[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
72 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
Oops, vscale * n * sizeof(ty).

:)


On 12 Jun 2018, at 18:18, Graham Hunter <[hidden email]> wrote:

Hi,

Just to clarify the part below about changing the proposal, this means that a <scalable n x ty> vector:

* Has a size of vscale * sizeof(ty) when reasoning about sizes within llvm
* Potentially has a smaller size of active_vscale * sizeof(ty) depending on the current
  processor state at runtime.

For SVE, active_vscale == vscale. For RVV (and potentially others), it can be smaller based
on the VL state generated by setvl or similar.

-Graham


I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.

My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of the
proposal and letting you find a better representation of active length.



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi David,

Responses below.

-Graham

On 11 Jun 2018, at 22:19, David A. Greene <[hidden email]> wrote:

Graham Hunter <[hidden email]> writes:

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

How does this work in practice?  Let's say I populate a vector with a
splat.  Presumably, that gives me a "full length" vector.  Let's say the
type is <scalable 2 x double>.  How do I split the vector and get
something half the width?  What is its type?  How do I split it again
and get something a quarter of the width?  What is its type?  How do I
use half- and quarter-width vectors?  Must I resort to predication?

To split a <scalable 2 x double> in half, you'd use a shufflevector in much the
same way you would for fixed-length vector types.

e.g.
``
  %sv = call <scalable 1 x i32> @llvm.experimental.vector.stepvector.nxv1i32()
  %halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2 x double> undef, <scalable 1 x i32> %sv
``

You can't split it any further than a <scalable 1 x <ty>>, since there may only be
one element in the actual hardware vector at runtime. The same restriction applies to
a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
scalable flag so that we can concatenate and split vectors, since SVE registers have
the same number of bytes and will therefore decrease the number of elements per
register as the element type increases in size.

If you want to extract something other than the first part of a vector, you need to add
offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
to reach the high half of a larger register).

If you check the patch which introduces splatvector (https://reviews.llvm.org/D47775),
you can see a line which currently produces an error if changing the size of a vector
is required, and notes that VECTOR_SHUFFLE_VAR hasn't been implemented yet.

In our downstream compiler, this is an ISD alongside VECTOR_SHUFFLE which
allows a shuffle with a variable mask instead of a constant.

If people feel it would be useful, I can prepare another patch which implements these
shuffles (as an intrinsic rather than a common ISD) for review now instead of later;
I tried to keep the initial patch set small so didn't cover all cases.

In terms of using predication, that's generally not required for the legal integer types;
normal promotion via sign or zero extension work. We've tried to reuse existing
mechanisms wherever possible.

For floating point types, we do use predication to allow the use of otherwise illegal
types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
not need to be represented in IR.


Ths split question comes into play for backward compatibility.  How
would one take a scalable vector and pass it into a NEON library?  It is
likely that some math functions, for example, will not have SVE versions
available.

I don't believe we intend to support this, but instead provide libraries with
SVE versions of functions instead. The problem is that you don't know how
many NEON-size subvectors exist within an SVE vector at compile time.
While you could create a loop with 'vscale' number of iterations and try to
extract those subvectors, I suspect the IR would end up being quite messy
and potentially hard to recognize and optimize.

The other problem with calling non-SVE functions is that any live SVE
registers must be spilled to the stack and filled after the call, which is
likely to be quite expensive.

Is there a way to represent "double width" vectors?  In mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on.  I guess the more
fundamental question is about how such loops should be handled.

For SVE, it's fine to generate IR with types that are 'double size' or larger,
and just leave it to legalization at SelectionDAG level to split into multiple
legal size registers.

Again, if people would like me to create a patch to illustrate legalization sooner
rather than later to help understand what's needed to support these types, let
me know.


What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero.  How would I, say,
insertelement into the upper half of the vector?  Or any arbitrary
place?  Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough?  It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

So you can insert or extract any element known to exist (in other words, it's
within the minimum number of elements). Using a constant index outside
that range will fail, as we won't know whether the element actually exists
until we're running on a cpu.

Our downstream compiler supports inserting and extracting arbitrary elements
from calculated offsets as part of our experiment on search loop vectorization,
but that generates the offsets based on a count of true bits within partitioned
predicates. I was planning on proposing new intrinsics to improve predicate use
within llvm at a later date.

We have been able to implement various types of known shuffles (like the high/low
half extract, zip, concatention, etc) with vscale, stepvector, and the existing IR
instructions.



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
Graham Hunter via llvm-dev <[hidden email]> writes:

> To split a <scalable 2 x double> in half, you'd use a shufflevector in much the
> same way you would for fixed-length vector types.
>
> e.g.
> ``
> %sv = call <scalable 1 x i32> @llvm.experimental.vector.stepvector.nxv1i32()
> %halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2 x double> undef, <scalable 1 x i32> %sv
> ``
>
> You can't split it any further than a <scalable 1 x <ty>>, since there may only be
> one element in the actual hardware vector at runtime. The same restriction applies to
> a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
> scalable flag so that we can concatenate and split vectors, since SVE registers have
> the same number of bytes and will therefore decrease the number of elements per
> register as the element type increases in size.

Right.  So let's say the hardware width is 1024.  If I have a
<scalable 2 x double> it is 1024 bits.  If I split it, it's a
<scalable 1 x double> (right?) with 512 bits.  There is no
way to create a 256-bit vector.

It's probably the case that for pure VL-agnostic code, this is ok.  Our
experience with the X1/X2, which also supported VL-agnostic code, was
that at times compiler awareness of the hardware MAXVL allowed us to
generate better code, better enough that we "cheated" regularly.  The
hardware guys loved us.  :)

I'm not at all saying that's a good idea for SVE, just recounting
experience and wondering what the implications would be for SVE and more
generally, LLVM IR.  Would the MIPS V people be interested in a
non-VL-agnostic compilation mode?

> If you want to extract something other than the first part of a vector, you need to add
> offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
> to reach the high half of a larger register).

Sure, that makes semse.

> For floating point types, we do use predication to allow the use of otherwise illegal
> types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
> not need to be represented in IR.

This is something done during or after isel?

>  Ths split question comes into play for backward compatibility. How
>  would one take a scalable vector and pass it into a NEON library? It is
>  likely that some math functions, for example, will not have SVE versions
>  available.
>
> I don't believe we intend to support this, but instead provide libraries with
> SVE versions of functions instead. The problem is that you don't know how
> many NEON-size subvectors exist within an SVE vector at compile time.
> While you could create a loop with 'vscale' number of iterations and try to
> extract those subvectors, I suspect the IR would end up being quite messy
> and potentially hard to recognize and optimize.

Yes, that was my concern.  The vscale loop is what I came up with as
well.  It is technically feasible, but ugly.  I'm a little concerned
about what vendors will do with this.  Not everyone is going to have the
resources to convert all of their NEON libraries, certainly not all at
once.

Just something to think about.

> The other problem with calling non-SVE functions is that any live SVE
> registers must be spilled to the stack and filled after the call, which is
> likely to be quite expensive.

Understood.

>  Is there a way to represent "double width" vectors? In mixed-data-size
>  loops it is sometimes convenient to reason about double-width vectors
>  rather than having to split them (to legalize for the target
>  architecture) and keep track of their parts early on. I guess the more
>  fundamental question is about how such loops should be handled.
>
> For SVE, it's fine to generate IR with types that are 'double size' or larger,
> and just leave it to legalization at SelectionDAG level to split into multiple
> legal size registers.

Ok, great.  If something is larger than "double size," how can it be
split, given the "split once" restriction above?

>  What do insertelement and extractelement mean for scalable vectors?
>  Your examples showed insertelement at index zero. How would I, say,
>  insertelement into the upper half of the vector? Or any arbitrary
>  place? Does insertelement at index 10 of a <scalable 2 x double> work,
>  assuming vscale is large enough? It is sometimes useful to constitute a
>  vector out of various scalar pieces and insertelement is a convenient
>  way to do it.
>
> So you can insert or extract any element known to exist (in other words, it's
> within the minimum number of elements). Using a constant index outside
> that range will fail, as we won't know whether the element actually exists
> until we're running on a cpu.

In that case to "insert" into the higher elements one would insert into
the known range and then shufflevector, I suppose.  Ok.

> Our downstream compiler supports inserting and extracting arbitrary elements
> from calculated offsets as part of our experiment on search loop vectorization,
> but that generates the offsets based on a count of true bits within partitioned
> predicates. I was planning on proposing new intrinsics to improve predicate use
> within llvm at a later date.

Ok, I look forward to seeing them!

> We have been able to implement various types of known shuffles (like the high/low
> half extract, zip, concatention, etc) with vscale, stepvector, and the existing IR
> instructions.

Yes, I can certainly see how all of those would be implemented.  The
main case I'm thinking about is something that is "scalarized" within a
vector loop context.  I'm wondering about the best way to reconstitute a
vector from scalar pieces (or vice-versa).

Thanks for the explanations!

                          -David

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
On 12 June 2018 at 14:47, Graham Hunter <[hidden email]> wrote:

>
> Hi Robin,
>
> responses inline.
>
> -Graham
>
> > On 11 Jun 2018, at 16:47, Robin Kruppe <[hidden email]> wrote:
> >
> > Hi Graham,
> > Hi David,
> >
> > glad to hear other people are thinking about RVV codegen!
> >
> > On 7 June 2018 at 18:10, Graham Hunter <[hidden email]> wrote:
> >>
> >> Hi,
> >>
> >>> On 6 Jun 2018, at 17:36, David A. Greene <[hidden email]> wrote:
> >>>
> >>> Graham Hunter via llvm-dev <[hidden email]> writes:
> >>>>> Is there anything about vscale or a scalable vector that requires a
> >>>>> minimum bit width?  For example, is this legal?
> >>>>>
> >>>>> <scalable 1 x double>
> >>>>>
> >>>>> I know it won't map to an SVE type.  I'm simply curious because
> >>>>> traditionally Cray machines defined vectors in terms of
> >>>>> machine-dependent "maxvl" with an element type, so with the above vscale
> >>>>> would == maxvl.  Not that we make any such things anymore.  But maybe
> >>>>> someone else does?
> >>>>
> >>>> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
> >>>> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
> >>>> length within the loop so that they can perform the last iterations of a loop within vector
> >>>> registers when there's less than a full register worth of data remaining. SVE uses
> >>>> predication (masking) to achieve the same effect.
> >
> >
> > Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
> > x T> is currently allowed and unlike the scalable variant that's not
> > even useful) and it would be the sole legal vector types in the RISCV
> > backend.
> >
> >>
> >>>> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
> >>>> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
> >>>> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
> >>>> predicate on all operations. This avoids needing to add a token operand to all existing
> >>>> instructions that work on vector types.
> >
> > Yes, vscale would be the *maximum* vector length (how many elements
> > fit into each register), not the *active* vector length (how many
> > elements are operated on in the current loop iteration).
> >
> > This has nothing to do with tokens, though. The tokens I proposed were
> > to encode the fact that even 'maxvl' varies on a function by function
> > basis. This RFC approaches the same issue differently, but it's still
> > there -- in terms of this RFC, operations on scalable vectors depend
> > on `vscale`, which is "not necessarily [constant] across functions".
> > That implies, for example, that an unmasked <scalable 4 x i32> load or
> > store (which accesses vscale * 16 bytes of memory) can't generally be
> > moved from one function to another unless it's somehow ensured that
> > both functions will have the same vscale. For that matter, the same
> > restriction applies to calls to `vscale` itself.
> >
> > The evolution of the active vector length is a separate problem and
> > one that doesn't really impact the IR type system (nor one that can
> > easily be solved by tokens).
>
> Agreed.
>
> >
> >>>
> >>> Right.  In that way the RISC V method is very much like what the old
> >>> Cray machines did with the Vector Length register.
> >>>
> >>> So in LLVM IR you would have "setvl" return a predicate and then apply
> >>> that predicate to operations using the current select method?  How does
> >>> instruction selection map that back onto a simple setvl + unpredicated
> >>> vector instructions?
> >>>
> >>> For conditional code both vector length and masking must be taken into
> >>> account.  If "setvl" returns a predicate then that predicate would have
> >>> to be combined in some way with the conditional predicate (typically via
> >>> an AND operation in an IR that directly supports predicates).  Since
> >>> LLVM IR doesn't have predicates _per_se_, would it turn into nested
> >>> selects or something?  Untangling that in instruction selection seems
> >>> difficult but perhaps I'm missing something.
> >>
> >> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
> >> been used, and replace the use of its value in AND operations with an
> >> all-true value (with constant folding to remove unnecessary ANDs) then
> >> replace any masked instructions (generally loads, stores, anything else
> >> that might generate an exception or modify state that it shouldn't) with
> >> target-specific nodes that understand the dynamic vlen.
> >
> > I am not quite so sure about turning the active vector length into
> > just another mask. It's true that the effects on arithmetic, load,
> > stores, etc. are the same as if everything executed under a mask like
> > <1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
> > active vector length. However, actually materializing the masks in the
> > IR means the RISCV backend has to reverse-engineer what it must do
> > with the vl register for any given (masked or unmasked) vector
> > operation. The stakes for that are rather high, because (1) it applies
> > to pretty much every single vector operation ever, and (2) when it
> > fails, the codegen impact is incredibly bad.
>
> I can see where the concern comes from; we had problems reconstructing
> semantics when experimenting with search loop vectorization and often
> had to fall back on default (slow) generic cases.
>
> My main reason for proposing this was to try and ensure that the size was
> consistent from the point of view of the query functions we were discussing
> in the main thread. If you're fine with all size queries assuming maxvl (so
> things like stack slots would always use the current configured maximum
> length), then I don't think there's a problem with dropping this part of the
> proposal and letting you find a better representation of active length.

Oh yeah, sure. The size of a vector is in terms of vscale, even from
the perspective of the RISC-V ISA. The active vector length just
modifies *operations* to produce partially-zero results and access
less memory. It's also perfectly reasonable to "mix" vectors computed
with different active vector lengths. For example, it might be useful
to create a loop-invariant vector in full before the loop and keep
using it in the loop body with varying vl. Add to that IR design
considerations and it's clear-cut to me that scalable vectors should
have *one* size and all operations should produce "full-width" (but
maybe partially-zeroed) vector results.

Partly for that reason (and also out of general hesitation to commit
to *any* model at this stage) I am lukewarm to the "active_vscale"
concept you outlined in a later email.

So, agreed, let's completely leave that part out for now. I'll come
back with specific proposals when the RVV work has matured and we have
more experience with this issue.

> > (1) The vl register affects not only loads, stores and other
> > operations with side effects, but all vector instructions, even pure
> > computation (and reg-reg moves, but that's not relevant for IR). An
> > integer vector add, for example, only computes src1[i] + src2[i] for 0
> > <= i < vl and the remaining elements of the destination register (from
> > vl upwards) are zeroed. This is very natural for strip-mined loops
> > (you'll never need those elements), but it means an unmasked IR level
> > vector add is a really bad fit for the RISC-V 'vadd' instruction.
> > Unless the backend can prove that only the first vl elements of the
> > result will ever be observed, it will have to temporarily set vl to
> > MAXVL so that the RVV instruction will actually compute the "full"
> > result. Establishing that seems like it will require at least some
> > custom data flow analysis, and it's unclear how robust it can be made.
> >
> > (2) Failing to properly use vl for some vector operation is worse than
> > e.g. materializing a mask you wouldn't otherwise need. It requires
> > that too (if the operation is masked), but more importantly it needs
> > to save vl, change it to MAXVL, and finally restore the old value.
> > That's quite expensive: besides the ca. 3 extra instructions and the
> > scratch GPR required, this save/restore dance can have other nasty
> > effects depending on uarch style. I'd have to consult the hardware
> > people to be sure, but from my understanding risks include pipeline
> > stalls and expensive roundtrips between decoupled vector and scalar
> > units.
>
> Ah, I hadn't appreciated you might need to save/restore the VL like that.
> I'd worked through a couple of small example loops and it seemed fine,
> but hadn't looked at more complicated cases.
>
> > To be clear: I have not yet experimented with any of this, so I'm not
> > saying this is a deal breaker. A well-engineered "demanded elements"
> > analysis may very well be good enough in practice. But since we
> > broached the subject, I wanted to mention this challenge. (I'm
> > currently side stepping it by not using built-in vector instructions
> > but instead intrinsics that treat vl as magic extra state.)
> >
> >> This could be part of lowering, or maybe a separate IR pass, rather than ISel.
> >> I *think* this will work, but if someone can come up with some IR where it
> >> wouldn't work then please let me know (e.g. global-state-changing instructions
> >> that could move out of blocks where one setvl predicate is used and into one
> >> where another is used).
> >
> > There are some operations that use vl for things other than simple
> > masking. To give one example, "speculative" loads (which silencing
> > some exceptions to safely permit vectorization of some loops with
> > data-dependent exits, such as strlen) can shrink vl as a side effect.
> > I believe this can be handled by modelling all relevant operations
> > (including setvl itself) as intrinsics that have side effects or
> > read/write inaccessible memory. However, if you want to have the
> > "current" vl (or equivalent mask) around as SSA value, you need to
> > "reload" it after any operation that updates vl. That seems like it
> > could get a bit complex if you want to do it efficiently (in the
> > limit, it seems equivalent to SSA construction).
>
> Ok; the fact that there's more instructions that can change vl and that you might
> need to reload it is useful to know.
>
> SVE uses predication to achieve the same via the first-faulting/no-faulting
> load instructions and the ffr register.
>
> I think SVE having 16 predicate registers (vs. 8 for RVV and AVX-512) has led
> to us using the feature quite widely with our own experiments; I'll try looking for
> non-predicated solutions as well when we try to expand scalable vectorization
> capabilities.
>
> >> Unfortunately, I can't find a description of the instructions included in
> >> the 'V' extension in the online manual (other than setvl or configuring
> >> registers), so I can't tell if there's something I'm missing.
> >
> > I'm very sorry for that, I know how frustrating it can be. I hope the
> > above gives a clearer picture of the constraints involved. Exact
> > instructions, let alone encodings, are still in flux as Bruce said.
>
> Yes, the above information is definitely useful, even if I don't have a complete
> picture yet. Thanks.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi David,

Responses inline.

-Graham

>> You can't split it any further than a <scalable 1 x <ty>>, since there may only be
>> one element in the actual hardware vector at runtime. The same restriction applies to
>> a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
>> scalable flag so that we can concatenate and split vectors, since SVE registers have
>> the same number of bytes and will therefore decrease the number of elements per
>> register as the element type increases in size.
>
> Right.  So let's say the hardware width is 1024.  If I have a
> <scalable 2 x double> it is 1024 bits.  If I split it, it's a
> <scalable 1 x double> (right?) with 512 bits.  There is no
> way to create a 256-bit vector.

Correct; you'd need to use predication to force smaller vectors.

> It's probably the case that for pure VL-agnostic code, this is ok.  Our
> experience with the X1/X2, which also supported VL-agnostic code, was
> that at times compiler awareness of the hardware MAXVL allowed us to
> generate better code, better enough that we "cheated" regularly.  The
> hardware guys loved us.  :)

I think if people want to take advantage of knowing the hardware vector
length for their particular machine, the existing fixed-length types will
suffice. We haven't implemented that for SVE at present, though -- a lot
of work would be required.

> I'm not at all saying that's a good idea for SVE, just recounting
> experience and wondering what the implications would be for SVE and more
> generally, LLVM IR.  Would the MIPS V people be interested in a
> non-VL-agnostic compilation mode?
>
>> If you want to extract something other than the first part of a vector, you need to add
>> offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
>> to reach the high half of a larger register).
>
> Sure, that makes semse.
>
>> For floating point types, we do use predication to allow the use of otherwise illegal
>> types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
>> not need to be represented in IR.
>
> This is something done during or after isel?

A combination of custom lowering and isel.

>> Ths split question comes into play for backward compatibility. How
>> would one take a scalable vector and pass it into a NEON library? It is
>> likely that some math functions, for example, will not have SVE versions
>> available.
>>
>> I don't believe we intend to support this, but instead provide libraries with
>> SVE versions of functions instead. The problem is that you don't know how
>> many NEON-size subvectors exist within an SVE vector at compile time.
>> While you could create a loop with 'vscale' number of iterations and try to
>> extract those subvectors, I suspect the IR would end up being quite messy
>> and potentially hard to recognize and optimize.
>
> Yes, that was my concern.  The vscale loop is what I came up with as
> well.  It is technically feasible, but ugly.  I'm a little concerned
> about what vendors will do with this.  Not everyone is going to have the
> resources to convert all of their NEON libraries, certainly not all at
> once.
>
> Just something to think about.

Understood; it may be that using predication (or a kernel call to limit
the maximum length to 128b) would suffice in those cases -- you can't take
advantage of the extra vector length, but since the bottom 128b of each
vector register alias with the NEON registers you could just call existing
NEON code without needing to do anything special that way. I guess it's a
trade-off as to whether you want to reuse existing tuned code or take
advantage of more powerful hardware. We haven't implemented anything like
this though.

>> The other problem with calling non-SVE functions is that any live SVE
>> registers must be spilled to the stack and filled after the call, which is
>> likely to be quite expensive.
>
> Understood.
>
>> Is there a way to represent "double width" vectors? In mixed-data-size
>> loops it is sometimes convenient to reason about double-width vectors
>> rather than having to split them (to legalize for the target
>> architecture) and keep track of their parts early on. I guess the more
>> fundamental question is about how such loops should be handled.
>>
>> For SVE, it's fine to generate IR with types that are 'double size' or larger,
>> and just leave it to legalization at SelectionDAG level to split into multiple
>> legal size registers.
>
> Ok, great.  If something is larger than "double size," how can it be
> split, given the "split once" restriction above?

There's not a 'split once' restriction, but 'n' in <scalable n x <ty>> has to
be an integer. A "double width" vector would just double 'n', so
<scalable 4 x double> would be such a type. That's not a legal type for SVE,
so it would be split into 2 <scalable 2 x double> values during legalization
(or nxv2f64, since it's at SDAG at that point).

>
>> What do insertelement and extractelement mean for scalable vectors?
>> Your examples showed insertelement at index zero. How would I, say,
>> insertelement into the upper half of the vector? Or any arbitrary
>> place? Does insertelement at index 10 of a <scalable 2 x double> work,
>> assuming vscale is large enough? It is sometimes useful to constitute a
>> vector out of various scalar pieces and insertelement is a convenient
>> way to do it.
>>
>> So you can insert or extract any element known to exist (in other words, it's
>> within the minimum number of elements). Using a constant index outside
>> that range will fail, as we won't know whether the element actually exists
>> until we're running on a cpu.
>
> In that case to "insert" into the higher elements one would insert into
> the known range and then shufflevector, I suppose.  Ok.
>
>> Our downstream compiler supports inserting and extracting arbitrary elements
>> from calculated offsets as part of our experiment on search loop vectorization,
>> but that generates the offsets based on a count of true bits within partitioned
>> predicates. I was planning on proposing new intrinsics to improve predicate use
>> within llvm at a later date.
>
> Ok, I look forward to seeing them!
>
>> We have been able to implement various types of known shuffles (like the high/low
>> half extract, zip, concatention, etc) with vscale, stepvector, and the existing IR
>> instructions.
>
> Yes, I can certainly see how all of those would be implemented.  The
> main case I'm thinking about is something that is "scalarized" within a
> vector loop context.  I'm wondering about the best way to reconstitute a
> vector from scalar pieces (or vice-versa).

If actual scalarization is required, SVE needs to use predicates; there are
instructions that allow moving a single predicate bit forward one element at
a time, and you can then extract the element, perform whatever operation is
needed, and insert the result back into a register at the same location.

There's obviously more overhead than just moving directly from/to a fixed
number of lanes, but it works. Hopefully we won't need to use scalarization
often.

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi,

I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.

Thanks,

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. An explanation of splitting/concatentating scalable vectors.

6. A brief note on code generation of these new operations for AArch64.

7. An example of C code and matching IR using the proposed extensions.

8. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types for
analysis of IR. While it has not been implemented in full, the general approach
works well for calculating offsets into structures with scalable types in a
modified version of ComputeValueVTs in our downstream compiler.

For current IR types that have a known size, all query functions return a single
integer constant. For scalable types a second integer is needed to indicate the
number of bytes/bits which need to be scaled by the runtime multiple to obtain
the actual length.

For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function `getScalableSizePairInBits()` will be added, which
returns a pair of integers (one to indicate unscaled bits, the other for bits
that need to be scaled by the runtime multiple). For backends that do not need
to deal with scalable types the existing methods will suffice, but a debug-only
assert will be added to them to ensure they aren't used on scalable types.

Similar functionality will be added to DataLayout.

Comparisons between sizes will use the following methods, assuming that X and
Y are non-zero integers and the form is of { unscaled, scaled }.

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

It's worth noting that we don't expect the last case (mixed scaled and
unscaled sizes) to occur. Richard Sandiford's proposed C extensions
(http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
prohibits mixing fixed-size types into sizeless struct.

I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

My current intention is to rely on functions that clone or copy values to
check whether they are being used to copy scalable vectors across function
boundaries without the inherit vlen attribute and raise an error there instead
of requiring passing the Function a type size is from for each comparison. If
there's a strong preference for moving the check to the size comparison function
let me know; I will be starting work on patches for this later in the year if
there's no major problems with the idea.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.

========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

===========================================
5. Splitting and Combining Scalable Vectors
===========================================

Splitting and combining scalable vectors in IR is done in the same manner as
for fixed-length vectors, but with a non-constant mask for the shufflevector.

The following is an example of splitting a <scalable 4 x double> into two
separate <scalable 2 x double> values.

``
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  ;; Stepvector generates the element ids for first subvector
  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
  ;; Add vscale * 2 to get the starting element for the second subvector
  %ec = mul i64 %vscale64, 2
  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
  ;; Perform the extracts
  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
``

==================
6. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
7. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
8. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev

Hi,

i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).

-- support for `llvm.experimental.vector.reduce.*` intrinsics --

RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with loops or the like.


-- setting the vector length (MVL) --

I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary.

However, i think there are some issues with the `vlen token` approach.

* Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens?

* My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point.

On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).
I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase.
Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway.

; explicit vlen_state modelling in RV could look like this:

@vlen_state = thread_local global token ; this gives AA a fixed point to constraint vlen-dependent operations

llvm.vla.setvl(i32 %n)                  ; implicitly writes-only %vlen_state
i32 llvm.vla.getvl()                    ; implicitly reads-only %vlen_state

llvm.vla.fadd.f64(f64, f64)           ; implicitly reads-only %vlen_state

llvm.vla.fdiv.f64(f64, f64)           : .. same

; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?)
<scalable 1 x f64> llvm.riscv.probe.f64(%ptr)

By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..).

Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.

Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM.

Regards,

Simon


On 07/02/2018 11:53 AM, Graham Hunter via llvm-dev wrote:
Hi,

I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.

Thanks,

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. An explanation of splitting/concatentating scalable vectors.

6. A brief note on code generation of these new operations for AArch64.

7. An example of C code and matching IR using the proposed extensions.

8. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types for
analysis of IR. While it has not been implemented in full, the general approach
works well for calculating offsets into structures with scalable types in a
modified version of ComputeValueVTs in our downstream compiler.

For current IR types that have a known size, all query functions return a single
integer constant. For scalable types a second integer is needed to indicate the
number of bytes/bits which need to be scaled by the runtime multiple to obtain
the actual length.

For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function `getScalableSizePairInBits()` will be added, which
returns a pair of integers (one to indicate unscaled bits, the other for bits
that need to be scaled by the runtime multiple). For backends that do not need
to deal with scalable types the existing methods will suffice, but a debug-only
assert will be added to them to ensure they aren't used on scalable types.

Similar functionality will be added to DataLayout.

Comparisons between sizes will use the following methods, assuming that X and
Y are non-zero integers and the form is of { unscaled, scaled }.

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

It's worth noting that we don't expect the last case (mixed scaled and
unscaled sizes) to occur. Richard Sandiford's proposed C extensions
(http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
prohibits mixing fixed-size types into sizeless struct.

I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

My current intention is to rely on functions that clone or copy values to
check whether they are being used to copy scalable vectors across function
boundaries without the inherit vlen attribute and raise an error there instead
of requiring passing the Function a type size is from for each comparison. If
there's a strong preference for moving the check to the size comparison function
let me know; I will be starting work on patches for this later in the year if
there's no major problems with the idea.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.

========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

===========================================
5. Splitting and Combining Scalable Vectors
===========================================

Splitting and combining scalable vectors in IR is done in the same manner as
for fixed-length vectors, but with a non-constant mask for the shufflevector.

The following is an example of splitting a <scalable 4 x double> into two
separate <scalable 2 x double> values.

``
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  ;; Stepvector generates the element ids for first subvector
  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
  ;; Add vscale * 2 to get the starting element for the second subvector
  %ec = mul i64 %vscale64, 2
  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
  ;; Perform the extracts
  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
``

==================
6. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
7. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
8. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : [hidden email]
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
Hi Simon,

Replies inline.

-Graham

> i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).
> -- support for `llvm.experimental.vector.reduce.*` intrinsics --
> RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with       loops or the like.

The vector reduction intrinsics were originally created to support SVE in order to avoid loops, so we'll definitely be using them.

> -- setting the vector length (MVL) --
> I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary.
> However, i think there are some issues with the `vlen token` approach.
> * Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens?

I think there's a bit of a mix-up here... my proposal doesn't feature tokens. Robin's proposal earlier in the year did, but I think we've reached a consensus that they aren't necessary.

We do need to decide where to place the appropriate checks for which function an instruction is from before allowing copying:

1. Solely within the passes that perform cross-function optimizations. Light-weight, but easier to get it wrong.

2. Within generic methods that insert instructions into blocks. Probably more code changes than method 1. May run into problems if an instruction is cloned first (and therefore has no parent to check -- looking at operands/uses may suffice though).

3. Within size queries. Probably insufficient in places where entire blocks are copied without looking at the types of each individual instruction, and also suffers from problems when cloning instructions.

My current idea is to proceed with option 2 with some additional checks where needed.

> * My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point.

So I think we've agreed that the notion of vscale inside a function is consistent, so that all size comparisons and stack allocations will use the maximum size for that function.

However, use of setvl or predication changes the effective length inside the function. This is already the case for masked loads and stores -- although an AVX512 vector is 512 bits in size, a different amount of data can be transferred to/from memory.

Robin will be working on the best way to represent setvl, whereas SVE will just use <scalable n x i1> predicate vectors to control length.

> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
>> There are some operations that use vl for things other than simple
>> masking. To give one example, "speculative" loads (which silencing
>> some exceptions to safely permit vectorization of some loops with
>> data-dependent exits, such as strlen) can shrink vl as a side effect.
>> I believe this can be handled by modelling all relevant operations
>> (including setvl itself) as intrinsics that have side effects or
>> read/write inaccessible memory. However, if you want to have the
>> "current" vl (or equivalent mask) around as SSA value, you need to
>> "reload" it after any operation that updates vl. That seems like it
>> could get a bit complex if you want to do it efficiently (in the
>> limit, it seems equivalent to SSA construction).
>>
> I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization     and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase.
> Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway.

This was in response to my suggestion to model setvl with predicates; I've withdrawn the idea. The vscale intrinsic is enough to represent 'maxvl', and based on the IR samples I've seen for RVV, a setvl intrinsic would return the dynamic length in order to correctly update offset/induction variables.

> ; explicit vlen_state modelling in RV could look like this:
>
> @vlen_state = thread_local global token ; this gives AA a fixed point to constraint vlen-dependent operations
>
> llvm.vla.setvl(i32 %n)                  ; implicitly writes-only %vlen_state
> i32 llvm.vla.getvl()                    ; implicitly reads-only %vlen_state
>
> llvm.vla.fadd.f64(f64, f64)           ; implicitly reads-only %vlen_state
> llvm.vla.fdiv.f64(f64, f64)           : .. same
>
> ; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?)
> <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)

Having separate getvl and setvl intrinsics may work nicely, but I'll leave that to Robin to decide.

> By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..).
> Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.

I got the impression that the RVV team wanted to be able to reconfigure registers (and therefore potentially change max vector length/number of available registers) for each function; if a call to a function is required from inside a vectorized loop then I think maxvl/vscale has to match and the callee must not reconfigure registers. I suspect there will be a complicated cost model to decide whether to change configuration or stick with a default of all registers enabled.

> Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM.

Glad to hear it; there is an SVE emulator[1] available so once we've managed to get some code committed you'll be able to try some of this out, at least on one of the architectures.

[1] https://developer.arm.com/products/software-development-tools/hpc/arm-instruction-emulator

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi Graham,

thanks again. The changes and additions all make sense to me, I just
have one minor comment about shufflevector.

On 2 July 2018 at 11:53, Graham Hunter <[hidden email]> wrote:

> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>    have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>    for a loop that has already been vectorized using the extensions described
>    in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                          functions that inherit vector length. Cannot be
>                          compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                              terms and try the above comparisons; it
>                              may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.

Seems reasonable.

> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.

It makes sense to have runtime-computed shuffle masks for some
architectures, especially those with runtime-variable vector lengths,
but lifting the current restriction that the shufflevector mask is a
constant affects all code that inspects the indices. There's lot such
code and as far as I've seen a fair amount of that code crucially
depends on the mask being constant. I'm not opposed to lifting the
restriction, but I want to call attention to it and double-check
everyone's okay with it because it seems like a big step and, unlike
other IR changes in this RFC, it isn't really necessary (we could also
use an intrinsic for these shuffles).

> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   ;; Stepvector generates the element ids for first subvector
>   %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>   ;; Add vscale * 2 to get the starting element for the second subvector
>   %ec = mul i64 %vscale64, 2
>   %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>   %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>   %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>   ;; Perform the extracts
>   %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>   %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi Simon,

to add to what Graham said...

On 2 July 2018 at 17:08, Simon Moll via llvm-dev
<[hidden email]> wrote:
> Hi,
>
> i am the main author of RV, the Region Vectorizer
> (github.com/cdl-saarland/rv). I want to share our standpoint as potential
> users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).

Thanks, this perspective is very useful, just as our discussion at EuroLLVM was.

> -- support for `llvm.experimental.vector.reduce.*` intrinsics --
>
> RV relies heavily on predicate reductions (`or` and `and` reduction) to tame
> divergent loops and provide a vector-length agnostic programming model on
> LLVM IR. I'd really like to see these adopted early on in the new VLA
> backends so we can fully support these targets from the start. Without these
> generic intrinsics, we would either need to emit target specific ones or go
> through the painful process of VLA-style reduction trees with loops or the
> like.
>
>
> -- setting the vector length (MVL) --
>
> I really like the idea of the `inherits_vlen` attribute. Absence of this
> attribute in a callee means we can safely stop tracking the vector length
> across the call boundary.
>
> However, i think there are some issues with the `vlen token` approach.
>
> * Why do you need an explicit vlen token if there is a 1 : 1-0
> correspondence between functions and vlen tokens?
>
> * My main concern is that you are navigating towards a local optimum here.
> All is well as long as there is only one vector length per function.
> However, if the architecture supports changing the vector length at any
> point but you explicitly forbid it, programmers will complain, well, i will
> for one ;-) Once you give in to that demand you are facing the situation
> that multiple vector length tokens are live within the same function. This
> means you have to stop transformations from mixing vector operations with
> different vector lengths: these would otherwise incur an expensive state
> change at every vlen transition. However, there is no natural way to express
> that two SSA values (vlen tokens) must not be live at the same program
> point.

Can you elaborate on what use cases you have in mind for this? I'm
very curious because I'm only aware of one case where changing vscale
at a specific point is desirable: when you have two independent code
regions (e.g., separate loop nests with no vector values flowing
between them) that are substantially different in their demands from
the vector unit. And even in that case, you only need a way to give
the backend the freedom to change it if beneficial, not to exert
actual control over the size of the vector registers [1]. As mentioned
in my RFC in April, I believe we can still support that case
reasonably well with an optimization pass in the backend (operating on
MIR, i.e., no IR changes).

Everything else I know of that falls under "changing vector lengths"
is better served by predication or RISC-V's "active vector length"
(vl) register. Even tricks for running code that is intended for
packed SIMD of a particular width on top of the variable-vector-length
RISC-V ISA only need to fiddle with the active vector length! And to
be clear, the active vector length is a completely separate mechanism
from the concept that's called vscale in this RFC, vlen in my previous
RFC, and MVL in the RISC-V ISA.

The restriction of 1 function : 1 vscale is not one that was adopted
lightly. In some sense, yes, it's just a local optimum and one can
think about IR designs that don't couple vscale to function
boundaries. However, there are multiple factors that make it
challenging to do in LLVM IR, and challenging to implement in the
backend too (some of them outlined in my RFC from April). After
tinkering with the problem for months, I'm fairly certain that
multiple vscales in one function is several orders of magnitude more
complex and difficult to add to LLVM IR (some different IRs would work
better), so I'd really like to understand any and all reasons
programmers might have for wanting to change vscale, and hopefully
find ways to support what they want to do without opening this
pandora's box.

[1] Software has extremely little control about vscale/MVL/etc.
anyway: on packed SIMD machines vscale = 1 is hardwired, on RISC-V you
can only change the vector unit configuration and thereby affect the
MVL in very indirect and uarch-dependent ways, and on SVE it's
implementation-defined which multiples of 128 bit are supported.

> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
>
> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need to
> "reload" it after any operation that updates vl. That seems like it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).
>
> I think modeling the vector length as state isn't as bad as it may sound
> first. In fact, how about modeling the "hard" vector length as a

The term "hard" vector length here makes me suspect you might be
mixing up the physical size of vector registers, which is derived from
vscale in IR terms or the vector unit configuration in the RISC-V ISA,
with the _active_ vector length which is effectively just encoding a
particular kind of predication to limit processing to a subset of
lanes in the physical registers. Since everything below makes more
sense when read as referring to the active vector length, I will
assume you meant that, but just to be sure could you please clarify?

> thread_local global variable? That way there is exactly one valid vector
> length value at every point (defined by the value of the thread_local global
> variable of the exact name). There is no need for a "demanded vlen"
> analyses: the global variable yields the value immediately. The RISC-V
> backend can map the global directly to the vlen register. If a target does
> not support a re-configurable vector length (SVE), it is safe to run SSA
> construction during legalization and use explicit predication instead. You'd
> perform SSA construction only at the backend/legalization phase.
> Vice versa coming from IR targeted at LLVM SVE, you can go the other way,
> run a demanded vlen analysis, and encode it explicitly in the program. vlen
> changes are expensive and should be rare anyway.

For the active vector length, yes, modelling the architectural state
as memory read and written by intrinsics works fine and is in fact
roughly what I'm currently doing. Globals can't be of type token, but
I use the existing "reads/write hidden memory" flag on intrinsics
instead of globals anyway (which increases AA precision, and doesn't
require special casing these intrinsics in AA code).

However, these intrinsics also have some downsides that might make
different solutions better in the long run. For example, the
artificial memory accesses block some optimizations that don't bother
reasoning about memory in detail, and many operations controlled by
the vector length are so similar to the existing add, mul, etc.
instructions that we'd duplicate the vast majority of optimizations
that apply to those instructions (if we want equal optimization power,
that is).

> ; explicit vlen_state modelling in RV could look like this:
>
> @vlen_state = thread_local global token ; this gives AA a fixed point to
> constraint vlen-dependent operations
>
> llvm.vla.setvl(i32 %n)                  ; implicitly writes-only %vlen_state
> i32 llvm.vla.getvl()                    ; implicitly reads-only %vlen_state
>
> llvm.vla.fadd.f64(f64, f64)           ; implicitly reads-only %vlen_state
> llvm.vla.fdiv.f64(f64, f64)           : .. same
>
> ; this implements the "speculative" load mentioned in the quote above
> (writes %vlen_state. I suppose it also reads it first?)
> <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)
>
> By relying on memory dependence, this also implies that arithmetic
> operations can be re-ordered freely as long as vlen_state does not change
> between them (SLP, "loop mix (CGO16)", ..).
>
> Regarding function calls, if the callee does not have the 'inherits_vlen'
> attribute, the target can use a default value at function entry (max width
> or "undef"). Otherwise, the vector length needs to be communicated from
> caller to callee. However, the `vlen_state` variable already achieves that
> for a first implementation.

As Graham said, on RISC-V a vector function call means that the caller
needs to configure the vector unit in a particular way (determined by
the callee's ABI), and the callee uses with that configuration. (And
the configuration determines the hardware vector length / vscale.)
This is a backend concern, except the backend needs to know whether a
function expects to be called in this way, or whether it can and needs
to pick a configuration for itself and set it up on function entry.
That's what motivates this attribute.

In terms of IR semantics, I would say vscale is *unspecified* on entry
into a function without the inherits_vscale (let's rename it to fit
this RFC's terminology) attribute. That means it's a program error to
assume you get the caller's vscale -- in scenarios where that's what
you want, you need to add the attribute everywhere. This corresponds
to the fact that in the absence of the attribute, the RISC-V backend
will make up a configuration for you and you'll get whatever vscale
that configuration implies, rather than the caller's.


Cheers,

Robin

> Last but not least, thank you all for working on this! I am really looking
> forward to playing around with vla architectures in LLVM.
>
> Regards,
>
> Simon
>
>
> On 07/02/2018 11:53 AM, Graham Hunter via llvm-dev wrote:
>
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread,
> reposted below. Let me know if I've missed anything or if more clarification
> is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for
> variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible;
> the
> only change needed in other backends is how size is queried on vector types,
> and
> it only requires a change in which function is called. We have created a set
> of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
> of
> this rfc) can be found on Phabricator and are intended to illustrate the
> scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
> the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
> which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
> constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
> that
>    have a length which is a runtime-determined multiple of a known base
> length.
>
> 2. Size Queries - how to reason about the size of types for which the size
> isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
> address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE
> instructions
>    for a loop that has already been vectorized using the extensions
> described
>    in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has
> been
> added to the `VectorType` class, which indicates that the number of elements
> in
> the vector is a runtime-determined integer multiple of the `NumElements`
> field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number
> of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
> introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
> avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
> used
> to obtain the (potentially scalable) number of elements. Overloaded division
> and
> multiplication operators allow an ElementCount instance to be used in much
> the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about
> the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
> SVE,
> but it is possible. The model of scalable vectors presented in this RFC
> assumes
> that the multiple will be constant within a function but not necessarily
> across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
> to
> inherit the multiple across function calls will allow for function calls
> with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
> of
> elements, and the string literal `scalable` indicates that the total number
> of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
> choice
> for indicating that the vector is scalable, and could be substituted by
> another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
> in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements,
> and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
> are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
> the
> type record. If the field is not present the type will default to a
> fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and
> recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
> creates
> new VectorTypes, much of that code doesn't need to care about whether the
> types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
> represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types
> for
> analysis of IR. While it has not been implemented in full, the general
> approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a
> single
> integer constant. For scalable types a second integer is needed to indicate
> the
> number of bytes/bits which need to be scaled by the runtime multiple to
> obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does
> today,
> except that it will no longer return a size for vector types (it will return
> 0,
> as it does for other derived types). The majority of calls to this function
> are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added,
> which
> returns a pair of integers (one to indicate unscaled bits, the other for
> bits
> that need to be scaled by the runtime multiple). For backends that do not
> need
> to deal with scalable types the existing methods will suffice, but a
> debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X
> and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                          functions that inherit vector length. Cannot be
>                          compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                              terms and try the above comparisons; it
>                              may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing
> scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there
> instead
> of requiring passing the Function a type size is from for each comparison.
> If
> there's a strong preference for moving the check to the size comparison
> function
> let me know; I will be starting work on patches for this later in the year
> if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the
> runtime
> length in IR in order to generate addresses for consecutive vectors in
> memory
> and determine how many elements have been processed in an iteration of a
> loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
> vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental
> `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32>
> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32>
> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32>
> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
> this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the
> shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   ;; Stepvector generates the element ids for first subvector
>   %sv1 = call <scalable 2 x i64>
> @llvm.experimental.vector.stepvector.nxv2i64()
>   ;; Add vscale * 2 to get the starting element for the second subvector
>   %ec = mul i64 %vscale64, 2
>   %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>   %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef,
> <scalable 2 x i32> zeroinitializer
>   %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>   ;; Perform the extracts
>   %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double>
> undef, <scalable 2 x i64> %sv1
>   %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double>
> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
> backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and
> stepvector
> are used to create the needed values and to advance the index variable in
> the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32,
> i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef,
> <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
>
> Simon Moll
> Researcher / PhD Student
>
> Compiler Design Lab (Prof. Hack)
> Saarland University, Computer Science
> Building E1.3, Room 4.31
>
> Tel. +49 (0)681 302-57521 : [hidden email]
> Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi Robin,

> On 7 Jul 2018, at 14:46, Robin Kruppe <[hidden email]> wrote:
>
> Hi Graham,
>
> thanks again. The changes and additions all make sense to me, I just
> have one minor comment about shufflevector.

Thanks, I think we're getting to the point where reviewing the patches makes sense
as a next step.

>> ===========================================
>> 5. Splitting and Combining Scalable Vectors
>> ===========================================
>>
>> Splitting and combining scalable vectors in IR is done in the same manner as
>> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> It makes sense to have runtime-computed shuffle masks for some
> architectures, especially those with runtime-variable vector lengths,
> but lifting the current restriction that the shufflevector mask is a
> constant affects all code that inspects the indices. There's lot such
> code and as far as I've seen a fair amount of that code crucially
> depends on the mask being constant. I'm not opposed to lifting the
> restriction, but I want to call attention to it and double-check
> everyone's okay with it because it seems like a big step and, unlike
> other IR changes in this RFC, it isn't really necessary (we could also
> use an intrinsic for these shuffles).

The way we implemented this was to have a second getShuffleMask function
which took a pointer to a Value instead of a Constant; this function will
return false if the mask is either scalable or not a constant, or true
and set the contents of the provided SmallVectorImpl ref to the constant
values.

This means all existing code for other targets doesn't need to change since
they will just call the existing method. Some common code will require changes,
but would also need to change anyway to add support for a variable shufflevector
intrinsic.

See patch https://reviews.llvm.org/D47775 -- the extra getShuffleMask is
implemented in lib/IR/Instructions.cpp and an example of its use is in
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp.

I did mention a possible VECTOR_SHUFFLE_VAR intrinsic in that patch, to be
used in the same manner as experimental_vector_splatvector as a stand-in
for an ISD node when lowering to SelectionDAG. In our downstream compiler
both of those are generic ISD nodes instead of intrinsics, and I think we'd
try to get that upstream eventually.

If another method is preferred, that's fine, but I think this minimizes the
changes required.

-Graham

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Robin Kruppe <[hidden email]> writes:

> Everything else I know of that falls under "changing vector lengths"
> is better served by predication or RISC-V's "active vector length"
> (vl) register.

Agreed.  A "vl" register is slightly more efficient in some cases
because forming predicates can be bothersome.

I also want to caution about predication in LLVM IR.  The way it's done
now is, I think, not quite kosher.  We use select to represent a
predicated operation, but select says nothing about suppressing the
evaluation of either input.  Therefore, there is nothing in the IR to
prevent code motion of Values outside the select.  Indeed, I ran into
this very problem a couple of months ago, where a legitimate (according
to the IR) code motion resulted in wrong answers in vectorized code
because what was supposed to be predicated was not.  We had to disable
the transformation to get things working.

Another consequence of this setup is that we need special intrinsics to
convey evaluation requirements.  We have masked
load/store/gather/scatter intrinsics and will be getting masked
floating-point intrinsics (or something like them).

Years ago we had some discussion about how to represent predication as a
first-class IR construct but at the time it was considered too
difficult.  With more and more architectures turning to predication for
performance, perhaps it's time to revisit that conversation.

                               -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev


On 07/09/2018 08:01 AM, David A. Greene via llvm-dev wrote:

> Robin Kruppe <[hidden email]> writes:
>
>> Everything else I know of that falls under "changing vector lengths"
>> is better served by predication or RISC-V's "active vector length"
>> (vl) register.
> Agreed.  A "vl" register is slightly more efficient in some cases
> because forming predicates can be bothersome.
>
> I also want to caution about predication in LLVM IR.  The way it's done
> now is, I think, not quite kosher.  We use select to represent a
> predicated operation, but select says nothing about suppressing the
> evaluation of either input.  Therefore, there is nothing in the IR to
> prevent code motion of Values outside the select.  Indeed, I ran into
> this very problem a couple of months ago, where a legitimate (according
> to the IR) code motion resulted in wrong answers in vectorized code
> because what was supposed to be predicated was not.  We had to disable
> the transformation to get things working.
>
> Another consequence of this setup is that we need special intrinsics to
> convey evaluation requirements.  We have masked
> load/store/gather/scatter intrinsics and will be getting masked
> floating-point intrinsics (or something like them).
>
> Years ago we had some discussion about how to represent predication as a
> first-class IR construct but at the time it was considered too
> difficult.  With more and more architectures turning to predication for
> performance, perhaps it's time to revisit that conversation.
I've also been seeing an increasing need for at (least some form of)
predication support in the IR/optimizer.  At the moment, I'm mainly
concerned with what our canonical form should look like.  I think that
adding first class predication is probably overkill at the moment.

In my case, I'm mostly interested in predicated scalar load and store at
the moment.  We could trivially extend
@llvm.masked.load/@llvm.masked.store to handle the scalar cases, but the
more I think about it, I'm not sure this is actually a good canonical
form since it requires updating huge portions of the optimizer to handle
what is essentially a new instruction.

One idea I've been playing with is to represent predicated operations as
selects-over-inputs, and a then guaranteed non-faulting operation.  For
instance, a predicated load might look like:
%pred_addr = select i1 %cnd, i32* %actual_addr, i32* %safe_addr
%pred_load = load i32 %pred_addr
where %safe_addr is something like an empty alloca or reserved global
variable.  The key idea is that this can be pattern matched to an
actually predicated load if available in hardware, but can also be
optimized normally (i.e. prove the select condition).

This basic idea can be extended to any potentially faulting instruction
by selecting a "safe" input (i.e. divide by select(pred, actual, 1)
etc..).  The obvious downside is the patterns can be broken up by the
optimizer in arbitrarily complex ways, but I wonder if that might be net
pro not a con.

At the moment, this is just one possible idea.  I'm not yet to the point
of making any actual proposals just yet.

Philip
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Hi,

Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.

-Graham

> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>   have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>   fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>   calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>   number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>   for a loop that has already been vectorized using the extensions described
>   in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>  elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>  bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                         functions that inherit vector length. Cannot be
>                         compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                             terms and try the above comparisons; it
>                             may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %index.next = add i64 %index, 4
>  ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>  ;; Splat a value
>  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>  ;; Add a constant sequence
>  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>  ;; Splat a value
>  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Splat offset + stride (the same in this case)
>  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>  ;; Add the runtime-generated sequence
>  %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  ;; Stepvector generates the element ids for first subvector
>  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>  ;; Add vscale * 2 to get the starting element for the second subvector
>  %ec = mul i64 %vscale64, 2
>  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>  ;; Perform the extracts
>  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>  for (int i = 0; i < count; ++i)
>    a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>  ;; Other setup
>  ;; Stepvector used to create initial identity vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  br vector.body
>
> vector.body
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>           ;; stepvector used for index identity on entry to loop body ;;
>  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                     [ %stepvector, %vector.body.preheader ]
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %vscale32 = trunc i64 %vscale64 to i32
>  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>           ;; vscale splat used to increment identity vector ;;
>  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>           ;; vscale used to increment loop index
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  %4 = icmp eq i64 %index.next, %n.vec
>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
I strongly suspect that there remains widespread concern with the direction of this, I know I have them.

I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.

Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.

Put differently: I don't think silence is assent here. You really need some clear signal of consensus.

On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
Hi,

Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.

-Graham

> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>   have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>   fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>   calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>   number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>   for a loop that has already been vectorized using the extensions described
>   in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>  elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>  bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                         functions that inherit vector length. Cannot be
>                         compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                             terms and try the above comparisons; it
>                             may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %index.next = add i64 %index, 4
>  ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>  ;; Splat a value
>  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>  ;; Add a constant sequence
>  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>  ;; Splat a value
>  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Splat offset + stride (the same in this case)
>  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>  ;; Add the runtime-generated sequence
>  %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  ;; Stepvector generates the element ids for first subvector
>  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>  ;; Add vscale * 2 to get the starting element for the second subvector
>  %ec = mul i64 %vscale64, 2
>  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>  ;; Perform the extracts
>  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>  for (int i = 0; i < count; ++i)
>    a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>  ;; Other setup
>  ;; Stepvector used to create initial identity vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  br vector.body
>
> vector.body
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>           ;; stepvector used for index identity on entry to loop body ;;
>  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                     [ %stepvector, %vector.body.preheader ]
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %vscale32 = trunc i64 %vscale64 to i32
>  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>           ;; vscale splat used to increment identity vector ;;
>  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>           ;; vscale used to increment loop index
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  %4 = icmp eq i64 %index.next, %n.vec
>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
(And in case it isn't clear, I *do* think it is reasonable to push on this thread and lobby to get that feedback! You shouldn't be stalled forever, I'm just trying to point out that the focus should be on getting the feedback, not whether there are objections. You want consensus here, not silence.)

On Mon, Jul 30, 2018 at 3:34 AM Chandler Carruth <[hidden email]> wrote:
I strongly suspect that there remains widespread concern with the direction of this, I know I have them.

I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.

Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.

Put differently: I don't think silence is assent here. You really need some clear signal of consensus.

On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
Hi,

Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.

-Graham

> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>   have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>   fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>   calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>   number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>   for a loop that has already been vectorized using the extensions described
>   in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>  elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>  bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                         functions that inherit vector length. Cannot be
>                         compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                             terms and try the above comparisons; it
>                             may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %index.next = add i64 %index, 4
>  ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>  ;; Splat a value
>  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>  ;; Add a constant sequence
>  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>  ;; Splat a value
>  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Splat offset + stride (the same in this case)
>  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>  ;; Add the runtime-generated sequence
>  %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  ;; Stepvector generates the element ids for first subvector
>  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>  ;; Add vscale * 2 to get the starting element for the second subvector
>  %ec = mul i64 %vscale64, 2
>  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>  ;; Perform the extracts
>  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>  for (int i = 0; i < count; ++i)
>    a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>  ;; Other setup
>  ;; Stepvector used to create initial identity vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  br vector.body
>
> vector.body
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>           ;; stepvector used for index identity on entry to loop body ;;
>  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                     [ %stepvector, %vector.body.preheader ]
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %vscale32 = trunc i64 %vscale64 to i32
>  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>           ;; vscale splat used to increment identity vector ;;
>  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>           ;; vscale used to increment loop index
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  %4 = icmp eq i64 %index.next, %n.vec
>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev
Chandler Carruth wrote:

> I strongly suspect that there remains widespread concern with the
> direction of this, I know I have them.
>
> I don't think that many of the people who have that concern have had
> time to come back to this RFC and make progress on it, likely because
> of other commitments or simply the amount of churn around SVE related
> patches and such. That is at least why I haven't had time to return to
> this RFC and try to write more detailed feedback.

We believe ARM SVE will be an important architecture going forward.  As
such, it's important to us that these questions and concerns get posted
and discussed, whatever the outcome may be.  If there are objections,
alternative proposals would be helpful.

I see a lot of SVE patches on Phab that are described as "not for
review."  I don't know how helpful that is.  It would be more helpful to
have actual patches intended for review/commit.  It is difficult to know
which is which in Phab.  Could patches not intended for review either be
removed if not needed, or their subjects updated to indicate they are
not for review but for discussion purposes so that it's easier to filter
search results?

                                  -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
In reply to this post by Sam McCall via llvm-dev

On 07/30/2018 05:34 AM, Chandler Carruth wrote:
I strongly suspect that there remains widespread concern with the direction of this, I know I have them.

I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.

Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.

At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless. I have two questions:

1.
This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full,

Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.

2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.

Thanks again,
Hal


Put differently: I don't think silence is assent here. You really need some clear signal of consensus.

On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
Hi,

Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.

-Graham

> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>   have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>   fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>   calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>   number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>   for a loop that has already been vectorized using the extensions described
>   in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>  elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>  bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                         functions that inherit vector length. Cannot be
>                         compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                             terms and try the above comparisons; it
>                             may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %index.next = add i64 %index, 4
>  ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>  ;; Splat a value
>  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>  ;; Add a constant sequence
>  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>  ;; Splat a value
>  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Splat offset + stride (the same in this case)
>  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>  ;; Add the runtime-generated sequence
>  %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  ;; Stepvector generates the element ids for first subvector
>  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>  ;; Add vscale * 2 to get the starting element for the second subvector
>  %ec = mul i64 %vscale64, 2
>  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>  ;; Perform the extracts
>  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>  for (int i = 0; i < count; ++i)
>    a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>  ;; Other setup
>  ;; Stepvector used to create initial identity vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  br vector.body
>
> vector.body
>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>           ;; stepvector used for index identity on entry to loop body ;;
>  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                     [ %stepvector, %vector.body.preheader ]
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %vscale32 = trunc i64 %vscale64 to i32
>  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>           ;; vscale splat used to increment identity vector ;;
>  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>           ;; vscale used to increment loop index
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  %4 = icmp eq i64 %index.next, %n.vec
>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
Hal Finkel via llvm-dev <[hidden email]> writes:

> 2. I know that there has been some discussion around support for
> changing the vector length during program execution (e.g., to account
> for some (proposed?) RISC-V feature), perhaps even during the
> execution of a single function. I'm very concerned about this idea
> because it is not at all clear to me how to limit information transfer
> contaminated with the vector size from propagating between different
> regions. As a result, I'm concerned about trying to add this on later,
> and so if this is part of the plan, I think that we need to think
> through the details up front because it could have a major impact on
> the design.

Can you elaborate a bit on your concerns?  I'm not sure how allowing
vector length changes impacts the design of this proposal.  As far as I
understand things, this proposal is about dealing with unknown vector
lengths, providing types and intrinsics where needed to support
necessary operations.  It seems to me that building support for changing
the vscale at runtime is somewhat orthogonal.  That is, anyone doing
such a thing will probably have to provide some more intrinsics to
capture the dependency chains and prevent miscompilation, but the basic
types and other intrinsics would remain the same.

What I'm going to say below is from my (narrow) perspective of machines
I've worked on.  It's not meant to cover all possibilities, things
people might do with RISC-V etc.  I intend it as a (common) example for
discussion.

Changing vector length during execution of a loop (for the last
iteration, typically) is very common for architectures without
predication.  Traditional Cray processors, for example, had a vector
length register.  The compiler had to manage updates to the vl register
just like any other register and instructions used vl as an implicit
operand.

I'm not sure exactly how the SVE proposal would address this kind of
operation.  llvm.experimental.vector.vscale is a vector length read.  I
could imagine a load intrinsic that takes a vscale value as an operand,
thus connecting the vector length to the load and the transitive closure
of its uses.  I could also imagine an intrinsic to change the vscale.
The trick would be to disallow reordering vector length reads and
writes.  None of this seems to require changes to the proposed type
system, only the addition of some (target-specific?) intrinsics.

I think it would be unlikely for anyone to need to change the vector
length during evaluation of an in-register expression.  That is, vector
length changes would normally be made only at observable points in the
program (loads, stores, etc.) and probably only at certain control-flow
boundaries (loop back-edges, function calls/returns and so on).  Thus we
wouldn't need intrinsics or other new IR for every possible operation in
LLVM, only at the boundaries.

                          -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sam McCall via llvm-dev
On Mon, 30 Jul 2018 at 20:57, David A. Greene via llvm-dev
<[hidden email]> wrote:
> I'm not sure exactly how the SVE proposal would address this kind of
> operation.

SVE uses predication. The physical number of lanes doesn't have to
change to have the same effect (alignment, tails).


> I think it would be unlikely for anyone to need to change the vector
> length during evaluation of an in-register expression.

The worry here is not within each instruction but across instructions.
SVE (and I think RISC-V) allow register size to be dynamically set.

For example, on the same machine, it may be 256 for one process and
512 for another (for example, to save power).

But the change is via a system register, so in theory, anyone can
write an inline asm in the beginning of a function and change the
vector length to whatever they want.

Worst still, people can do that inside loops, or in a tail loop,
thinking it's a good idea (or this is a Cray machine :).

AFAIK, the interface for changing the register length will not be
exposed programmatically, so in theory, we should not worry about it.
Any inline asm hack can be considered out of scope / user error.

However, Hal's concern seems to be that, in the event of anyone
planning to add it to their APIs, we need to make sure the proposed
semantics can cope with it (do we need to update the predicates again?
what will vscale mean, then and when?).

If not, we may have to enforce that this will not come to pass in its
current form. In this case, changing it later will require *a lot*
more effort than doing it now.

So, it would be good to get a clear response from the two fronts (SVE
and RISC-V) about the future intention to expose that or not.

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
1234