[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On 1 August 2018 at 18:26, Hal Finkel <[hidden email]> wrote:

>
> On 08/01/2018 06:15 AM, Renato Golin wrote:
>> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev
>> <[hidden email]> wrote:
>>> In some sense, if you make vscale dynamic,
>>> you've introduced dependent types into LLVM's type system, but you've
>>> done it in an implicit manner. It's not clear to me that works. If we
>>> need dependent types, then an explicit dependence seems better. (e.g.,
>>> <scalable <n> x %vscale_var x <type>>)
>> That's a shift from the current proposal and I think we can think
>> about it after the current changes. For now, both SVE and RISC-V are
>> proposing function boundaries for changes in vscale.
>
> I understand. I'm afraid that the function-boundary idea doesn't work
> reasonably.

FWIW, I don't think dependent types really help with the code motion
problems. While using an SSA value in a type would presumably enforce
that instructions mentioning that type have to be dominated by the
definition of said value, the real problem is when you _stop_ using
one vscale (and presumably start using another). For example, we want
to rule out the following:

  %vscale.1 = call i32 @change_vscale(...)
  %v.1 = load <scalable 4 x %vscale.1 x i32> ...
  %vscale.2 = call i32 @change_vscale(...)
  %v.2 = load <scalable 4 x %vscale.1 x i32> ... ; vscale changed but
we're still doing things with the old one

And of course, actually introducing this notion of types mentioning
SSA values into LLVM would be an extraordinarily huge and difficult
step. I did actually consider something along these lines (and even
had a digression about it in drafts of my RFC, but I cut it in the
final version) but I don't think it's viable.

Tying some values to the function they're in, on the other hand, even
has precedent in current LLVM: tokens values must be confined to one
function (intrinsics are special, of course), so most of the
interprocedural passes already must be careful with moving certain
kinds of values between functions. It's ad-hoc and requires auditing
passes, yes, but it's something we know and have some experience with.

(The similarity to tokens is strong enough that my original proposal
heavily leaned on tokens to encode the restrictions on the optimizer
that are needed for different-vscale-per-function, but I've been
persuaded that it's more trouble than it's worth, hence the "implicit"
approach of this RFC.)

>>
>>
>>> 2. How would the function-call boundary work? Does the function itself
>>> have intrinsics that change the vscale?
>> Functions may not know what their vscale is until they're actually
>> executed. They could even have different vscales for different call
>> sites.
>>
>> AFAIK, it's not up to the compiled program (ie via a function
>> attribute or an inline asm call) to change the vscale, but the
>> kernel/hardware can impose dynamic restrictions on the process. But,
>> for now, only at (binary object) function boundaries.
>
> I'm not sure if that's better or worse than the compiler putting in code
> to indicate that the vscale might change. How do vector function
> arguments work if vscale gets larger? or smaller?

I don't see any way for the OS to change a running process's vscale
without a great amount of cooperation from the program and the
compiler. In general, the kernel has nowhere near enough information
to identify spots where it's safe to fiddle with vscale -- function
call boundaries aren't safe in general, as you point out.  FWIW, in
the RISC-V vector task group we discussed migrating running processes
between cores in heterogenous architectures (e.g. think big.LITTLE)
that may have different vector register sizes. We quickly agreed that
there's no way to make that work and dismissed the idea. The current
thinking is, if you want to migrate a process that's currently using
the vector unit, you can only migrate it between cores that have the
same kind of register field.

For the RISC-V backend I don't want anything to do with OS
shenangians, I'm exclusively focused on codegen. The backend inserts
machine code in the prologue that configures the vector unit in
whatever way the backend considers best, and this configuration
determines vscale (and some other things that aren't visible to IR).
The caller saves their vector unit state before the call and restores
it after the call returns, so their vscale is not affected by the call
either.

For SVE, I could imagine a function attribute that indicates it's OK
to change vscale at this point (this will probably have to be a very
careful and deliberate decision by a programmer). The backend could
then change vscale in the prologie, either set it to a specific value
(e.g., requested by the attribute) or make a libcall asking the kernel
to adjust vscale if it wants to.

In both cases, the change happens after the caller saved all their
state and before any of the callee's code runs.

That leaves arguments and return values, and more generally any vector
values that are shared (e.g., in memory) between caller and callee.
Indeed it's not possible to share any vectors between two functions
that disagree on how large a vector is (sounds obvious when you put it
that way). If you need to pass vectors in any way, caller and callee
have to agree on vscale as part of the ABI, and the callee does *not*
change vscale but "inherits" it from the caller. On SVE that's the
default ABI, on RISC-V there will be one or multiple non-default
"vector call" ABIs (as Bruce mentioned in an earlier email).

In IR we could represent these different ABIs though calling
convention numbers, function attributes, or a combination thereof.
With ABIs where caller and callee don't necessarily agree on vscale,
it is simply impossible to pass vector values (and while you can e.g.
pass the caller's vscale value, it probably isn't meaningful to the
callee):

- it's a verifier error if such a function takes or returns scalable
vectors directly
- a source program that e.g. tries to smuggle a vector from one
function to another through heap memory is erroneous
- the optimizer must not introduce such errors in correct input programs

The last point means, for example, that partial inlining can't pull
the computation of a vector value into the caller and pass the result
as a new argument. Such optimizations wouldn't be correct anyway,
regardless of ABI concerns: the instructions that are affected all
depend on vscale and therefore moving them to a different function
changes their behavior. Of course, this doesn't mean all
interprocedural optimizations are invalid. *Complete* inlining, for
example, is always valid.

Of course, all this applies only if caller and callee don't agree on
vscale. With suitable ABIs, all existing optimizations can be applied
without problem.

> So, if I have some vectorized code, and we figure out that some of it is
> cold, so we outline it, and then the kernel decides to decrease vscale
> for that function, now I have broken the application? Storing a vector
> argument in memory in that function now doesn't store as much data as it
> would have in the caller?
>
>>
>> I don't know how that works at the kernel level (how to detect those
>> boundaries? instrument every branch?) but this is what I understood
>> from the current discussion.
>
> Can we find out?
>
>>
>>
>>> If so, then it's not clear that
>>> the function-call boundary makes sense unless you prevent inlining. If
>>> you prevent inlining, when does that decision get made? Will the
>>> vectorizer need to outline loops? If so, outlining can have a real cost
>>> that's difficult to model. How do return types work?
>> The dynamic nature is not part of the program, so inlining can happen
>> as always. Given that the vectors are agnostic of size and work
>> regardless of what the kernel provides (within safety boundaries), the
>> code generation shouldn't change too much.
>>
>> We may have to create artefacts to restrict the maximum vscale (for
>> safety), but others are better equipped to answer that question.
>>
>>
>>>  1. I can definitely see the use cases for changing vscale dynamically,
>>> and so I do suspect that we'll want that support.
>> At a process/function level, yes. Within the same self-contained
>> sub-graph, I don't know.
>>
>>
>>>  2. LLVM does not have loops as first-class constructs. We only have SSA
>>> (and, thus, dominance), and when specifying restrictions on placement of
>>> things in function bodies, we need to do so in terms of these constructs
>>> that we have (which don't include loops).
>> That's why I was trying to define the "self-contained sub-graph" above
>> (there must be a better term for that). It has to do with data
>> dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure
>> side-effects don't leak out.
>>
>> A loop iteration is usually such a block, but not all are and not all
>> such blocks are loops.
>>
>> Changing vscale inside a function, but outside of those blocks would
>> be "fine", as long as we made sure code movement respects those
>> boundaries and that context would be restored correctly on exceptions.
>> But that's not part of the current proposal.
>
> But I don't know how to implement that restriction without major changes
> to the code base. Such a restriction doesn't follow from use/def chains,
> and if we need a restriction that involves looking for non-SSA
> dependencies (e.g., memory dependencies), then I think that we need
> something different than the current proposal. Explicitly dependent
> types might work, something like intrinsics might work, etc.

Seconded, this is an extraordinarily difficult problem. I've spent
unreasonable amounts of time thinking about ways to model changing
vector sizes and sketching countless designs for it. Multiple times I
convinced myself some clever setup would work, and every time I later
discovered a fatal flaw. Until I settled on "only at funciton
boundaries", that is, and even that took a few iterations.


Cheers,
Robin

> Thanks again,
> Hal
>
>>
>> Chaning vscale inside one of those blocks would be madness. :)
>>
>> cheers,
>> --renato
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
On 1 August 2018 at 21:43, Hal Finkel <[hidden email]> wrote:

>
> On 08/01/2018 02:00 PM, Graham Hunter wrote:
>> Hi Hal,
>>
>>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>>
>>>
>>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>>
>>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>>
>>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
>> Thanks, that's good to hear.
>>
>>> 1.
>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>>> analysis of IR. While it has not been implemented in full,
>>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
>> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
>
> At least on this point, I think that we'll want to have the
> implementation to help make sure there aren't important details we're
> overlooking.

+1

>>
>>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>
> I think that this will likely work, although I think we want to invert
> the sense of the attribute. vscale should be inherited by default, and
> some attribute can say that this isn't so. That same attribute, I
> imagine, will also forbid scalable vector function arguments and return
> values on those functions. If we don't have inherited vscale as the
> default, we place an implicit contract on any IR transformation hat
> performs outlining that it needs to scan for certain kinds of vector
> operations and add the special attribute, or just always add this
> special attribute, and that just becomes another special case, which
> will only actually manifest on certain platforms, that it's best to avoid.

It's a real relief to hear that you think this "will likely work".

Inverting the attribute seems good to me. I probably proposed not
inheriting by default because that's the default on RISC-V, but your
rationale is convincing.

>>
>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>
> My point is that, while there may be some sense in which the details can
> be worked out later, we need to have a good-enough understanding of how
> this will work now in order to make sure that we're not making design
> decisions now that make handling the dynamic vscale in a reasonable way
> later more difficult.

Sorry if I'm a broken record, but I believe Graham was referring to
the _active vector length_ or VL here, which has nothing to do with
vscale, dynamic or not. I described earlier why I think the former
doesn't interact with the contents of this RFC in any interesting way.
If you think otherwise, could you elaborate on why you think that?


Cheers,
Robin

> Thanks again,
> Hal
>
>>
>> -Graham
>>
>>> Thanks again,
>>> Hal
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
On 31 July 2018 at 23:32, David A. Greene <[hidden email]> wrote:

> Robin Kruppe <[hidden email]> writes:
>
>> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
>> with limiting that to function boundaries. The use case is *not*
>> "changing how large vectors are" in the middle of a loop or something
>> like that, which we all agree is very dubious at best. The RISC-V
>> vector unit is just very configurable (number of registers, vector
>> element sizes, etc.) and this configuration can impact how large the
>> vector registers are. For any given vectorized loop next we want to
>> configure the vector unit to suit that piece of code and run the loop
>> with whatever register size that configuration yields. And when that
>> loop is done, we stop using the vector unit entirely and disable it,
>> so that the next loop can use it differently, possibly with a
>> different register size. For IR modeling purposes, I propose to
>> enlarge "loop nest" to "function" but the same principle applies, it
>> just means all vectorized loops in the function will have to share a
>> configuration.
>>
>> Without getting too far into the details, does this make sense as a
>> use case?
>
> I think so.  If changing vscale has some important advantage (saving
> power?), I wonder how the compiler will deal with very large functions.
> I have seen some truly massive Fortran subroutines with hundreds of loop
> nests in them, possibly with very different iteration counts for each
> one.

Yeah, many loops with different demands on the vector unit in one
function is a problem for the "one vscale per function" approach.
Though for the record, the differences that matter here are not trip
count, but things like register pressure and the bit widths of the
vector elements.

There are some (fragile) workarounds for this problem, such as
splitting up the function. There's also the possibility of optimizing
for this case in the backend: trying to recognize when you can use
different configurations/vscales for two loops without changing
observable behavior (no vector values live between the loops, vscale
doesn't escape, etc.). In general this is of course extremely
difficult, but I hope it'll work well enough in practice to mitigate
this problem somewhat. This is just an educated guess at this point,
we'll have to wait and see how big the impact is on real applications
and real hardware (or simulations thereof).

But at the end of the day, sure, maybe we'll generate sub-optimal code
for some applications. That's still better than making the problem
intractable by being too greedy and ending up with either a broken
compiler or one that can't vary vscale at all.


Cheers,
Robin
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev

On 08/01/2018 02:59 PM, Robin Kruppe wrote:

> On 1 August 2018 at 18:26, Hal Finkel <[hidden email]> wrote:
>> On 08/01/2018 06:15 AM, Renato Golin wrote:
>>> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev
>>> <[hidden email]> wrote:
>>>> In some sense, if you make vscale dynamic,
>>>> you've introduced dependent types into LLVM's type system, but you've
>>>> done it in an implicit manner. It's not clear to me that works. If we
>>>> need dependent types, then an explicit dependence seems better. (e.g.,
>>>> <scalable <n> x %vscale_var x <type>>)
>>> That's a shift from the current proposal and I think we can think
>>> about it after the current changes. For now, both SVE and RISC-V are
>>> proposing function boundaries for changes in vscale.
>> I understand. I'm afraid that the function-boundary idea doesn't work
>> reasonably.
> FWIW, I don't think dependent types really help with the code motion
> problems.

Good point.

...

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev

On 08/01/2018 03:09 PM, Robin Kruppe wrote:

> ...
>> I think that this will likely work, although I think we want to invert
>> the sense of the attribute. vscale should be inherited by default, and
>> some attribute can say that this isn't so. That same attribute, I
>> imagine, will also forbid scalable vector function arguments and return
>> values on those functions. If we don't have inherited vscale as the
>> default, we place an implicit contract on any IR transformation hat
>> performs outlining that it needs to scan for certain kinds of vector
>> operations and add the special attribute, or just always add this
>> special attribute, and that just becomes another special case, which
>> will only actually manifest on certain platforms, that it's best to avoid.
> It's a real relief to hear that you think this "will likely work".
>
> Inverting the attribute seems good to me. I probably proposed not
> inheriting by default because that's the default on RISC-V, but your
> rationale is convincing.
>
>>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>> My point is that, while there may be some sense in which the details can
>> be worked out later, we need to have a good-enough understanding of how
>> this will work now in order to make sure that we're not making design
>> decisions now that make handling the dynamic vscale in a reasonable way
>> later more difficult.
> Sorry if I'm a broken record, but I believe Graham was referring to
> the _active vector length_ or VL here, which has nothing to do with
> vscale, dynamic or not. I described earlier why I think the former
> doesn't interact with the contents of this RFC in any interesting way.
> If you think otherwise, could you elaborate on why you think that?

Was it decided that this issue is equivalent to, or a subset of, 
per-lane predication on load, stores, and similar? Or is it different?

Thanks again,
Hal

>
>
> Cheers,
> Robin
>
>> Thanks again,
>> Hal
>>
>>> -Graham
>>>
>>>> Thanks again,
>>>> Hal
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On 2 August 2018 at 00:25, Hal Finkel <[hidden email]> wrote:

>
> On 08/01/2018 03:09 PM, Robin Kruppe wrote:
>> ...
>>> I think that this will likely work, although I think we want to invert
>>> the sense of the attribute. vscale should be inherited by default, and
>>> some attribute can say that this isn't so. That same attribute, I
>>> imagine, will also forbid scalable vector function arguments and return
>>> values on those functions. If we don't have inherited vscale as the
>>> default, we place an implicit contract on any IR transformation hat
>>> performs outlining that it needs to scan for certain kinds of vector
>>> operations and add the special attribute, or just always add this
>>> special attribute, and that just becomes another special case, which
>>> will only actually manifest on certain platforms, that it's best to avoid.
>> It's a real relief to hear that you think this "will likely work".
>>
>> Inverting the attribute seems good to me. I probably proposed not
>> inheriting by default because that's the default on RISC-V, but your
>> rationale is convincing.
>>
>>>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>>> My point is that, while there may be some sense in which the details can
>>> be worked out later, we need to have a good-enough understanding of how
>>> this will work now in order to make sure that we're not making design
>>> decisions now that make handling the dynamic vscale in a reasonable way
>>> later more difficult.
>> Sorry if I'm a broken record, but I believe Graham was referring to
>> the _active vector length_ or VL here, which has nothing to do with
>> vscale, dynamic or not. I described earlier why I think the former
>> doesn't interact with the contents of this RFC in any interesting way.
>> If you think otherwise, could you elaborate on why you think that?
>
> Was it decided that this issue is equivalent to, or a subset of,
> per-lane predication on load, stores, and similar? Or is it different?

It is equivalent a subset. If there are k lanes, vector instructions
execute under a mask that enables the first VL lanes and disables the
remaining (k - VL) lanes. That same mask is computed by SVE
instructions such as whilelt. This style of predication can be
combined with a more conventional and more general one-bit-per-lane
mask, then the instruction executes under the conjunction of these two
masks.


Cheers,
Robin

> Thanks again,
> Hal
>
>>
>>
>> Cheers,
>> Robin
>>
>>> Thanks again,
>>> Hal
>>>
>>>> -Graham
>>>>
>>>>> Thanks again,
>>>>> Hal
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi,

Just a quick question about bitcode format changes; is there anything special I should be doing for that beyond ensuring the reader can still process older bitcode files correctly?

The code in the main patch will always emit 3 records for a vector type (as opposed to the current 2), but we could omit the third field for fixed-length vectors if that's preferable.

-Graham

> On 1 Aug 2018, at 20:00, Graham Hunter via llvm-dev <[hidden email]> wrote:
>
> Hi Hal,
>
>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>
>>
>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>
>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>
>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>>
>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
>
> Thanks, that's good to hear.
>
>> 1.
>>> This is a proposal for how to deal with querying the size of scalable types for
>>>> analysis of IR. While it has not been implemented in full,
>>
>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
>
> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
>
>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>
> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>
> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>
> -Graham
>
>>
>> Thanks again,
>> Hal
>>
>>>
>>> Put differently: I don't think silence is assent here. You really need some clear signal of consensus.
>>>
>>> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
>>> Hi,
>>>
>>> Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.
>>>
>>> -Graham
>>>
>>>> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>>>>
>>>> Thanks,
>>>>
>>>> -Graham
>>>>
>>>> =============================================================
>>>> Supporting SIMD instruction sets with variable vector lengths
>>>> =============================================================
>>>>
>>>> In this RFC we propose extending LLVM IR to support code-generation for variable
>>>> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
>>>> approach is backwards compatible and should be as non-intrusive as possible; the
>>>> only change needed in other backends is how size is queried on vector types, and
>>>> it only requires a change in which function is called. We have created a set of
>>>> proof-of-concept patches to represent a simple vectorized loop in IR and
>>>> generate SVE instructions from that IR. These patches (listed in section 7 of
>>>> this rfc) can be found on Phabricator and are intended to illustrate the scope
>>>> of changes required by the general approach described in this RFC.
>>>>
>>>> ==========
>>>> Background
>>>> ==========
>>>>
>>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
>>>> AArch64 which is intended to scale with hardware such that the same binary
>>>> running on a processor with longer vector registers can take advantage of the
>>>> increased compute power without recompilation.
>>>>
>>>> As the vector length is no longer a compile-time known value, the way in which
>>>> the LLVM vectorizer generates code requires modifications such that certain
>>>> values are now runtime evaluated expressions instead of compile-time constants.
>>>>
>>>> Documentation for SVE can be found at
>>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>>
>>>> ========
>>>> Contents
>>>> ========
>>>>
>>>> The rest of this RFC covers the following topics:
>>>>
>>>> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>>>>  have a length which is a runtime-determined multiple of a known base length.
>>>>
>>>> 2. Size Queries - how to reason about the size of types for which the size isn't
>>>>  fully known at compile time.
>>>>
>>>> 3. Representing the runtime multiple of vector length in IR for use in address
>>>>  calculations and induction variable comparisons.
>>>>
>>>> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>>>>  number of elements.
>>>>
>>>> 5. An explanation of splitting/concatentating scalable vectors.
>>>>
>>>> 6. A brief note on code generation of these new operations for AArch64.
>>>>
>>>> 7. An example of C code and matching IR using the proposed extensions.
>>>>
>>>> 8. A list of patches demonstrating the changes required to emit SVE instructions
>>>>  for a loop that has already been vectorized using the extensions described
>>>>  in this RFC.
>>>>
>>>> ========
>>>> 1. Types
>>>> ========
>>>>
>>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>>> added to the `VectorType` class, which indicates that the number of elements in
>>>> the vector is a runtime-determined integer multiple of the `NumElements` field.
>>>> Most code that deals with vectors doesn't need to know the exact length, but
>>>> does need to know relative lengths -- e.g. get a vector with the same number of
>>>> elements but a different element type, or with half or double the number of
>>>> elements.
>>>>
>>>> In order to allow code to transparently support scalable vectors, we introduce
>>>> an `ElementCount` class with two members:
>>>>
>>>> - `unsigned Min`: the minimum number of elements.
>>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>>
>>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>>> equal to one and thus `Min` represents the exact number of elements in the
>>>> vector.
>>>>
>>>> The intent for code working with vectors is to use convenience methods and avoid
>>>> directly dealing with the number of elements. If needed, calling
>>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>>> multiplication operators allow an ElementCount instance to be used in much the
>>>> same manner as an integer for most cases.
>>>>
>>>> This mixture of compile-time and runtime quantities allow us to reason about the
>>>> relationship between different scalable vector types without knowing their
>>>> exact length.
>>>>
>>>> The runtime multiple is not expected to change during program execution for SVE,
>>>> but it is possible. The model of scalable vectors presented in this RFC assumes
>>>> that the multiple will be constant within a function but not necessarily across
>>>> functions. As suggested in the recent RISC-V rfc, a new function attribute to
>>>> inherit the multiple across function calls will allow for function calls with
>>>> vector arguments/return values and inlining/outlining optimizations.
>>>>
>>>> IR Textual Form
>>>> ---------------
>>>>
>>>> The textual form for a scalable vector is:
>>>>
>>>> ``<scalable <n> x <type>>``
>>>>
>>>> where `type` is the scalar type of each element, `n` is the minimum number of
>>>> elements, and the string literal `scalable` indicates that the total number of
>>>> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
>>>> for indicating that the vector is scalable, and could be substituted by another.
>>>> For fixed-length vectors, the `scalable` is omitted, so there is no change in
>>>> the format for existing vectors.
>>>>
>>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>>> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
>>>> used within the same function):
>>>>
>>>> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>>>> elements.
>>>>
>>>> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>>>> bytes.
>>>>
>>>> IR Bitcode Form
>>>> ---------------
>>>>
>>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>>> type record. If the field is not present the type will default to a fixed-length
>>>> vector type, preserving backwards compatibility.
>>>>
>>>> Alternatives Considered
>>>> -----------------------
>>>>
>>>> We did consider one main alternative -- a dedicated target type, like the
>>>> x86_mmx type.
>>>>
>>>> A dedicated target type would either need to extend all existing passes that
>>>> work with vectors to recognize the new type, or to duplicate all that code
>>>> in order to get reasonable code generation and autovectorization.
>>>>
>>>> This hasn't been done for the x86_mmx type, and so it is only capable of
>>>> providing support for C-level intrinsics instead of being used and recognized by
>>>> passes inside llvm.
>>>>
>>>> Although our current solution will need to change some of the code that creates
>>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>>> are scalable or not -- they can use preexisting methods like
>>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>>> the number of elements.
>>>>
>>>> ===============
>>>> 2. Size Queries
>>>> ===============
>>>>
>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>> analysis of IR. While it has not been implemented in full, the general approach
>>>> works well for calculating offsets into structures with scalable types in a
>>>> modified version of ComputeValueVTs in our downstream compiler.
>>>>
>>>> For current IR types that have a known size, all query functions return a single
>>>> integer constant. For scalable types a second integer is needed to indicate the
>>>> number of bytes/bits which need to be scaled by the runtime multiple to obtain
>>>> the actual length.
>>>>
>>>> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
>>>> except that it will no longer return a size for vector types (it will return 0,
>>>> as it does for other derived types). The majority of calls to this function are
>>>> already for scalar rather than vector types.
>>>>
>>>> For derived types, a function `getScalableSizePairInBits()` will be added, which
>>>> returns a pair of integers (one to indicate unscaled bits, the other for bits
>>>> that need to be scaled by the runtime multiple). For backends that do not need
>>>> to deal with scalable types the existing methods will suffice, but a debug-only
>>>> assert will be added to them to ensure they aren't used on scalable types.
>>>>
>>>> Similar functionality will be added to DataLayout.
>>>>
>>>> Comparisons between sizes will use the following methods, assuming that X and
>>>> Y are non-zero integers and the form is of { unscaled, scaled }.
>>>>
>>>> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>>>>
>>>> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>>>>                        functions that inherit vector length. Cannot be
>>>>                        compared across non-inheriting functions.
>>>>
>>>> { X, 0 } > { 0, Y }: Cannot return true.
>>>>
>>>> { X, 0 } = { 0, Y }: Cannot return true.
>>>>
>>>> { X, 0 } < { 0, Y }: Can return true.
>>>>
>>>> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>>>>                            terms and try the above comparisons; it
>>>>                            may not be possible to get a good answer.
>>>>
>>>> It's worth noting that we don't expect the last case (mixed scaled and
>>>> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
>>>> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
>>>> prohibits mixing fixed-size types into sizeless struct.
>>>>
>>>> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
>>>> vs. unscaled; I believe the gcc implementation of SVE allows for such
>>>> results, but that supports a generic polynomial length representation.
>>>>
>>>> My current intention is to rely on functions that clone or copy values to
>>>> check whether they are being used to copy scalable vectors across function
>>>> boundaries without the inherit vlen attribute and raise an error there instead
>>>> of requiring passing the Function a type size is from for each comparison. If
>>>> there's a strong preference for moving the check to the size comparison function
>>>> let me know; I will be starting work on patches for this later in the year if
>>>> there's no major problems with the idea.
>>>>
>>>> Future Work
>>>> -----------
>>>>
>>>> Since we cannot determine the exact size of a scalable vector, the
>>>> existing logic for alias detection won't work when multiple accesses
>>>> share a common base pointer with different offsets.
>>>>
>>>> However, SVE's predication will mean that a dynamic 'safe' vector length
>>>> can be determined at runtime, so after initial support has been added we
>>>> can work on vectorizing loops using runtime predication to avoid aliasing
>>>> problems.
>>>>
>>>> Alternatives Considered
>>>> -----------------------
>>>>
>>>> Marking scalable vectors as unsized doesn't work well, as many parts of
>>>> llvm dealing with loads and stores assert that 'isSized()' returns true
>>>> and make use of the size when calculating offsets.
>>>>
>>>> We have considered introducing multiple helper functions instead of
>>>> using direct size queries, but that doesn't cover all cases. It may
>>>> still be a good idea to introduce them to make the purpose in a given
>>>> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>>>>
>>>> ========================================
>>>> 3. Representing Vector Length at Runtime
>>>> ========================================
>>>>
>>>> With a scalable vector type defined, we now need a way to represent the runtime
>>>> length in IR in order to generate addresses for consecutive vectors in memory
>>>> and determine how many elements have been processed in an iteration of a loop.
>>>>
>>>> We have added an experimental `vscale` intrinsic to represent the runtime
>>>> multiple. Multiplying the result of this intrinsic by the minimum number of
>>>> elements in a vector gives the total number of elements in a scalable vector.
>>>>
>>>> Fixed-Length Code
>>>> -----------------
>>>>
>>>> Assuming a vector type of <4 x <ty>>
>>>> ``
>>>> vector.body:
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>> ;; <loop body>
>>>> ;; Increment induction var
>>>> %index.next = add i64 %index, 4
>>>> ;; <check and branch>
>>>> ``
>>>> Scalable Equivalent
>>>> -------------------
>>>>
>>>> Assuming a vector type of <scalable 4 x <ty>>
>>>> ``
>>>> vector.body:
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>> ;; <loop body>
>>>> ;; Increment induction var
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>> ;; <check and branch>
>>>> ``
>>>> ===========================
>>>> 4. Generating Vector Values
>>>> ===========================
>>>> For constant vector values, we cannot specify all the elements as we can for
>>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>>> can be used in the same manner as fixed-length vectors for a constant zero
>>>> splat. This can then be combined with `insertelement` and `shufflevector`
>>>> to create arbitrary value splats in the same manner as fixed-length vectors.
>>>>
>>>> For constants consisting of a sequence of values, an experimental `stepvector`
>>>> intrinsic has been added to represent a simple constant of the form
>>>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>>>> start can be added, and changing the step requires multiplying by a splat.
>>>>
>>>> Fixed-Length Code
>>>> -----------------
>>>> ``
>>>> ;; Splat a value
>>>> %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>>>> %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>>>> ;; Add a constant sequence
>>>> %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
>>>> ``
>>>> Scalable Equivalent
>>>> -------------------
>>>> ``
>>>> ;; Splat a value
>>>> %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>>>> %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> ;; Splat offset + stride (the same in this case)
>>>> %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>>>> %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> ;; Create sequence for scalable vector
>>>> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>>>> %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>>>> %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>>>> ;; Add the runtime-generated sequence
>>>> %add = add <scalable 4 x i32> %splat, %addoffset
>>>> ``
>>>> Future Work
>>>> -----------
>>>>
>>>> Intrinsics cannot currently be used for constant folding. Our downstream
>>>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>>>> for good code generation, so we will need to find new ways to recognize and
>>>> fold these values.
>>>>
>>>> ===========================================
>>>> 5. Splitting and Combining Scalable Vectors
>>>> ===========================================
>>>>
>>>> Splitting and combining scalable vectors in IR is done in the same manner as
>>>> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>>>>
>>>> The following is an example of splitting a <scalable 4 x double> into two
>>>> separate <scalable 2 x double> values.
>>>>
>>>> ``
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> ;; Stepvector generates the element ids for first subvector
>>>> %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>>>> ;; Add vscale * 2 to get the starting element for the second subvector
>>>> %ec = mul i64 %vscale64, 2
>>>> %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>>>> %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>>>> %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>>>> ;; Perform the extracts
>>>> %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>>>> %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
>>>> ``
>>>>
>>>> ==================
>>>> 6. Code Generation
>>>> ==================
>>>>
>>>> IR splats will be converted to an experimental splatvector intrinsic in
>>>> SelectionDAGBuilder.
>>>>
>>>> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>>>>
>>>> Two new AArch64ISD nodes have been added to represent the same concepts
>>>> at the SelectionDAG level, while splatvector maps onto the existing
>>>> AArch64ISD::DUP.
>>>>
>>>> GlobalISel
>>>> ----------
>>>>
>>>> Since GlobalISel was enabled by default on AArch64, it was necessary to add
>>>> scalable vector support to the LowLevelType implementation. A single bit was
>>>> added to the raw_data representation for vectors and vectors of pointers.
>>>>
>>>> In addition, types that only exist in destination patterns are planted in
>>>> the enumeration of available types for generated code. While this may not be
>>>> necessary in future, generating an all-true 'ptrue' value was necessary to
>>>> convert a predicated instruction into an unpredicated one.
>>>>
>>>> ==========
>>>> 7. Example
>>>> ==========
>>>>
>>>> The following example shows a simple C loop which assigns the array index to
>>>> the array elements matching that index. The IR shows how vscale and stepvector
>>>> are used to create the needed values and to advance the index variable in the
>>>> loop.
>>>>
>>>> C Code
>>>> ------
>>>>
>>>> ``
>>>> void IdentityArrayInit(int *a, int count) {
>>>> for (int i = 0; i < count; ++i)
>>>>   a[i] = i;
>>>> }
>>>> ``
>>>>
>>>> Scalable IR Vector Body
>>>> -----------------------
>>>>
>>>> ``
>>>> vector.body.preheader:
>>>> ;; Other setup
>>>> ;; Stepvector used to create initial identity vector
>>>> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>>>> br vector.body
>>>>
>>>> vector.body
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>> %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>>
>>>>          ;; stepvector used for index identity on entry to loop body ;;
>>>> %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>>>>                                    [ %stepvector, %vector.body.preheader ]
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> %vscale32 = trunc i64 %vscale64 to i32
>>>> %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>>>>
>>>>          ;; vscale splat used to increment identity vector ;;
>>>> %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>>>> %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>>>> %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>> %3 = bitcast i32* %2 to <scalable 4 x i32>*
>>>> store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>>>>
>>>>          ;; vscale used to increment loop index
>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>> %4 = icmp eq i64 %index.next, %n.vec
>>>> br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>>> ``
>>>>
>>>> ==========
>>>> 8. Patches
>>>> ==========
>>>>
>>>> List of patches:
>>>>
>>>> 1. Extend VectorType: https://reviews.llvm.org/D32530
>>>> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
>>>> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
>>>> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>>>> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>>>> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>>>> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>>>> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>>>> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>>>> 10. Initial store patterns: https://reviews.llvm.org/D47776
>>>> 11. Initial addition patterns: https://reviews.llvm.org/D47777
>>>> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>>>> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
>>>> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>>>>
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev

On 08/03/2018 05:31 AM, Graham Hunter wrote:
> Hi,
>
> Just a quick question about bitcode format changes; is there anything special I should be doing for that beyond ensuring the reader can still process older bitcode files correctly?
>
> The code in the main patch will always emit 3 records for a vector type (as opposed to the current 2), but we could omit the third field for fixed-length vectors if that's preferable.

Any reason not to omit the field? This can affect object-file size when
using LTO, etc.

 -Hal

>
> -Graham
>
>> On 1 Aug 2018, at 20:00, Graham Hunter via llvm-dev <[hidden email]> wrote:
>>
>> Hi Hal,
>>
>>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>>
>>>
>>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>>
>>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>>
>>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
>> Thanks, that's good to hear.
>>
>>> 1.
>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>>> analysis of IR. While it has not been implemented in full,
>>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
>> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
>>
>>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>>
>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>>
>> -Graham
>>
>>> Thanks again,
>>> Hal
>>>
>>>> Put differently: I don't think silence is assent here. You really need some clear signal of consensus.
>>>>
>>>> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.
>>>>
>>>> -Graham
>>>>
>>>>> On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Graham
>>>>>
>>>>> =============================================================
>>>>> Supporting SIMD instruction sets with variable vector lengths
>>>>> =============================================================
>>>>>
>>>>> In this RFC we propose extending LLVM IR to support code-generation for variable
>>>>> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
>>>>> approach is backwards compatible and should be as non-intrusive as possible; the
>>>>> only change needed in other backends is how size is queried on vector types, and
>>>>> it only requires a change in which function is called. We have created a set of
>>>>> proof-of-concept patches to represent a simple vectorized loop in IR and
>>>>> generate SVE instructions from that IR. These patches (listed in section 7 of
>>>>> this rfc) can be found on Phabricator and are intended to illustrate the scope
>>>>> of changes required by the general approach described in this RFC.
>>>>>
>>>>> ==========
>>>>> Background
>>>>> ==========
>>>>>
>>>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
>>>>> AArch64 which is intended to scale with hardware such that the same binary
>>>>> running on a processor with longer vector registers can take advantage of the
>>>>> increased compute power without recompilation.
>>>>>
>>>>> As the vector length is no longer a compile-time known value, the way in which
>>>>> the LLVM vectorizer generates code requires modifications such that certain
>>>>> values are now runtime evaluated expressions instead of compile-time constants.
>>>>>
>>>>> Documentation for SVE can be found at
>>>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>>>
>>>>> ========
>>>>> Contents
>>>>> ========
>>>>>
>>>>> The rest of this RFC covers the following topics:
>>>>>
>>>>> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>>>>>  have a length which is a runtime-determined multiple of a known base length.
>>>>>
>>>>> 2. Size Queries - how to reason about the size of types for which the size isn't
>>>>>  fully known at compile time.
>>>>>
>>>>> 3. Representing the runtime multiple of vector length in IR for use in address
>>>>>  calculations and induction variable comparisons.
>>>>>
>>>>> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>>>>>  number of elements.
>>>>>
>>>>> 5. An explanation of splitting/concatentating scalable vectors.
>>>>>
>>>>> 6. A brief note on code generation of these new operations for AArch64.
>>>>>
>>>>> 7. An example of C code and matching IR using the proposed extensions.
>>>>>
>>>>> 8. A list of patches demonstrating the changes required to emit SVE instructions
>>>>>  for a loop that has already been vectorized using the extensions described
>>>>>  in this RFC.
>>>>>
>>>>> ========
>>>>> 1. Types
>>>>> ========
>>>>>
>>>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>>>> added to the `VectorType` class, which indicates that the number of elements in
>>>>> the vector is a runtime-determined integer multiple of the `NumElements` field.
>>>>> Most code that deals with vectors doesn't need to know the exact length, but
>>>>> does need to know relative lengths -- e.g. get a vector with the same number of
>>>>> elements but a different element type, or with half or double the number of
>>>>> elements.
>>>>>
>>>>> In order to allow code to transparently support scalable vectors, we introduce
>>>>> an `ElementCount` class with two members:
>>>>>
>>>>> - `unsigned Min`: the minimum number of elements.
>>>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>>>
>>>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>>>> equal to one and thus `Min` represents the exact number of elements in the
>>>>> vector.
>>>>>
>>>>> The intent for code working with vectors is to use convenience methods and avoid
>>>>> directly dealing with the number of elements. If needed, calling
>>>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>>>> multiplication operators allow an ElementCount instance to be used in much the
>>>>> same manner as an integer for most cases.
>>>>>
>>>>> This mixture of compile-time and runtime quantities allow us to reason about the
>>>>> relationship between different scalable vector types without knowing their
>>>>> exact length.
>>>>>
>>>>> The runtime multiple is not expected to change during program execution for SVE,
>>>>> but it is possible. The model of scalable vectors presented in this RFC assumes
>>>>> that the multiple will be constant within a function but not necessarily across
>>>>> functions. As suggested in the recent RISC-V rfc, a new function attribute to
>>>>> inherit the multiple across function calls will allow for function calls with
>>>>> vector arguments/return values and inlining/outlining optimizations.
>>>>>
>>>>> IR Textual Form
>>>>> ---------------
>>>>>
>>>>> The textual form for a scalable vector is:
>>>>>
>>>>> ``<scalable <n> x <type>>``
>>>>>
>>>>> where `type` is the scalar type of each element, `n` is the minimum number of
>>>>> elements, and the string literal `scalable` indicates that the total number of
>>>>> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
>>>>> for indicating that the vector is scalable, and could be substituted by another.
>>>>> For fixed-length vectors, the `scalable` is omitted, so there is no change in
>>>>> the format for existing vectors.
>>>>>
>>>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>>>> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
>>>>> used within the same function):
>>>>>
>>>>> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>>>>> elements.
>>>>>
>>>>> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>>>>> bytes.
>>>>>
>>>>> IR Bitcode Form
>>>>> ---------------
>>>>>
>>>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>>>> type record. If the field is not present the type will default to a fixed-length
>>>>> vector type, preserving backwards compatibility.
>>>>>
>>>>> Alternatives Considered
>>>>> -----------------------
>>>>>
>>>>> We did consider one main alternative -- a dedicated target type, like the
>>>>> x86_mmx type.
>>>>>
>>>>> A dedicated target type would either need to extend all existing passes that
>>>>> work with vectors to recognize the new type, or to duplicate all that code
>>>>> in order to get reasonable code generation and autovectorization.
>>>>>
>>>>> This hasn't been done for the x86_mmx type, and so it is only capable of
>>>>> providing support for C-level intrinsics instead of being used and recognized by
>>>>> passes inside llvm.
>>>>>
>>>>> Although our current solution will need to change some of the code that creates
>>>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>>>> are scalable or not -- they can use preexisting methods like
>>>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>>>> the number of elements.
>>>>>
>>>>> ===============
>>>>> 2. Size Queries
>>>>> ===============
>>>>>
>>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>>> analysis of IR. While it has not been implemented in full, the general approach
>>>>> works well for calculating offsets into structures with scalable types in a
>>>>> modified version of ComputeValueVTs in our downstream compiler.
>>>>>
>>>>> For current IR types that have a known size, all query functions return a single
>>>>> integer constant. For scalable types a second integer is needed to indicate the
>>>>> number of bytes/bits which need to be scaled by the runtime multiple to obtain
>>>>> the actual length.
>>>>>
>>>>> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
>>>>> except that it will no longer return a size for vector types (it will return 0,
>>>>> as it does for other derived types). The majority of calls to this function are
>>>>> already for scalar rather than vector types.
>>>>>
>>>>> For derived types, a function `getScalableSizePairInBits()` will be added, which
>>>>> returns a pair of integers (one to indicate unscaled bits, the other for bits
>>>>> that need to be scaled by the runtime multiple). For backends that do not need
>>>>> to deal with scalable types the existing methods will suffice, but a debug-only
>>>>> assert will be added to them to ensure they aren't used on scalable types.
>>>>>
>>>>> Similar functionality will be added to DataLayout.
>>>>>
>>>>> Comparisons between sizes will use the following methods, assuming that X and
>>>>> Y are non-zero integers and the form is of { unscaled, scaled }.
>>>>>
>>>>> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>>>>>
>>>>> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>>>>>                        functions that inherit vector length. Cannot be
>>>>>                        compared across non-inheriting functions.
>>>>>
>>>>> { X, 0 } > { 0, Y }: Cannot return true.
>>>>>
>>>>> { X, 0 } = { 0, Y }: Cannot return true.
>>>>>
>>>>> { X, 0 } < { 0, Y }: Can return true.
>>>>>
>>>>> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>>>>>                            terms and try the above comparisons; it
>>>>>                            may not be possible to get a good answer.
>>>>>
>>>>> It's worth noting that we don't expect the last case (mixed scaled and
>>>>> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
>>>>> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
>>>>> prohibits mixing fixed-size types into sizeless struct.
>>>>>
>>>>> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
>>>>> vs. unscaled; I believe the gcc implementation of SVE allows for such
>>>>> results, but that supports a generic polynomial length representation.
>>>>>
>>>>> My current intention is to rely on functions that clone or copy values to
>>>>> check whether they are being used to copy scalable vectors across function
>>>>> boundaries without the inherit vlen attribute and raise an error there instead
>>>>> of requiring passing the Function a type size is from for each comparison. If
>>>>> there's a strong preference for moving the check to the size comparison function
>>>>> let me know; I will be starting work on patches for this later in the year if
>>>>> there's no major problems with the idea.
>>>>>
>>>>> Future Work
>>>>> -----------
>>>>>
>>>>> Since we cannot determine the exact size of a scalable vector, the
>>>>> existing logic for alias detection won't work when multiple accesses
>>>>> share a common base pointer with different offsets.
>>>>>
>>>>> However, SVE's predication will mean that a dynamic 'safe' vector length
>>>>> can be determined at runtime, so after initial support has been added we
>>>>> can work on vectorizing loops using runtime predication to avoid aliasing
>>>>> problems.
>>>>>
>>>>> Alternatives Considered
>>>>> -----------------------
>>>>>
>>>>> Marking scalable vectors as unsized doesn't work well, as many parts of
>>>>> llvm dealing with loads and stores assert that 'isSized()' returns true
>>>>> and make use of the size when calculating offsets.
>>>>>
>>>>> We have considered introducing multiple helper functions instead of
>>>>> using direct size queries, but that doesn't cover all cases. It may
>>>>> still be a good idea to introduce them to make the purpose in a given
>>>>> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>>>>>
>>>>> ========================================
>>>>> 3. Representing Vector Length at Runtime
>>>>> ========================================
>>>>>
>>>>> With a scalable vector type defined, we now need a way to represent the runtime
>>>>> length in IR in order to generate addresses for consecutive vectors in memory
>>>>> and determine how many elements have been processed in an iteration of a loop.
>>>>>
>>>>> We have added an experimental `vscale` intrinsic to represent the runtime
>>>>> multiple. Multiplying the result of this intrinsic by the minimum number of
>>>>> elements in a vector gives the total number of elements in a scalable vector.
>>>>>
>>>>> Fixed-Length Code
>>>>> -----------------
>>>>>
>>>>> Assuming a vector type of <4 x <ty>>
>>>>> ``
>>>>> vector.body:
>>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>>> ;; <loop body>
>>>>> ;; Increment induction var
>>>>> %index.next = add i64 %index, 4
>>>>> ;; <check and branch>
>>>>> ``
>>>>> Scalable Equivalent
>>>>> -------------------
>>>>>
>>>>> Assuming a vector type of <scalable 4 x <ty>>
>>>>> ``
>>>>> vector.body:
>>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>>> ;; <loop body>
>>>>> ;; Increment induction var
>>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>>> ;; <check and branch>
>>>>> ``
>>>>> ===========================
>>>>> 4. Generating Vector Values
>>>>> ===========================
>>>>> For constant vector values, we cannot specify all the elements as we can for
>>>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>>>> can be used in the same manner as fixed-length vectors for a constant zero
>>>>> splat. This can then be combined with `insertelement` and `shufflevector`
>>>>> to create arbitrary value splats in the same manner as fixed-length vectors.
>>>>>
>>>>> For constants consisting of a sequence of values, an experimental `stepvector`
>>>>> intrinsic has been added to represent a simple constant of the form
>>>>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>>>>> start can be added, and changing the step requires multiplying by a splat.
>>>>>
>>>>> Fixed-Length Code
>>>>> -----------------
>>>>> ``
>>>>> ;; Splat a value
>>>>> %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>>>>> %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>>>>> ;; Add a constant sequence
>>>>> %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
>>>>> ``
>>>>> Scalable Equivalent
>>>>> -------------------
>>>>> ``
>>>>> ;; Splat a value
>>>>> %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>>>>> %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>>> ;; Splat offset + stride (the same in this case)
>>>>> %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>>>>> %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>>> ;; Create sequence for scalable vector
>>>>> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>>>>> %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>>>>> %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>>>>> ;; Add the runtime-generated sequence
>>>>> %add = add <scalable 4 x i32> %splat, %addoffset
>>>>> ``
>>>>> Future Work
>>>>> -----------
>>>>>
>>>>> Intrinsics cannot currently be used for constant folding. Our downstream
>>>>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>>>>> for good code generation, so we will need to find new ways to recognize and
>>>>> fold these values.
>>>>>
>>>>> ===========================================
>>>>> 5. Splitting and Combining Scalable Vectors
>>>>> ===========================================
>>>>>
>>>>> Splitting and combining scalable vectors in IR is done in the same manner as
>>>>> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>>>>>
>>>>> The following is an example of splitting a <scalable 4 x double> into two
>>>>> separate <scalable 2 x double> values.
>>>>>
>>>>> ``
>>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>>> ;; Stepvector generates the element ids for first subvector
>>>>> %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>>>>> ;; Add vscale * 2 to get the starting element for the second subvector
>>>>> %ec = mul i64 %vscale64, 2
>>>>> %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>>>>> %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>>>>> %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>>>>> ;; Perform the extracts
>>>>> %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>>>>> %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
>>>>> ``
>>>>>
>>>>> ==================
>>>>> 6. Code Generation
>>>>> ==================
>>>>>
>>>>> IR splats will be converted to an experimental splatvector intrinsic in
>>>>> SelectionDAGBuilder.
>>>>>
>>>>> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>>>>>
>>>>> Two new AArch64ISD nodes have been added to represent the same concepts
>>>>> at the SelectionDAG level, while splatvector maps onto the existing
>>>>> AArch64ISD::DUP.
>>>>>
>>>>> GlobalISel
>>>>> ----------
>>>>>
>>>>> Since GlobalISel was enabled by default on AArch64, it was necessary to add
>>>>> scalable vector support to the LowLevelType implementation. A single bit was
>>>>> added to the raw_data representation for vectors and vectors of pointers.
>>>>>
>>>>> In addition, types that only exist in destination patterns are planted in
>>>>> the enumeration of available types for generated code. While this may not be
>>>>> necessary in future, generating an all-true 'ptrue' value was necessary to
>>>>> convert a predicated instruction into an unpredicated one.
>>>>>
>>>>> ==========
>>>>> 7. Example
>>>>> ==========
>>>>>
>>>>> The following example shows a simple C loop which assigns the array index to
>>>>> the array elements matching that index. The IR shows how vscale and stepvector
>>>>> are used to create the needed values and to advance the index variable in the
>>>>> loop.
>>>>>
>>>>> C Code
>>>>> ------
>>>>>
>>>>> ``
>>>>> void IdentityArrayInit(int *a, int count) {
>>>>> for (int i = 0; i < count; ++i)
>>>>>   a[i] = i;
>>>>> }
>>>>> ``
>>>>>
>>>>> Scalable IR Vector Body
>>>>> -----------------------
>>>>>
>>>>> ``
>>>>> vector.body.preheader:
>>>>> ;; Other setup
>>>>> ;; Stepvector used to create initial identity vector
>>>>> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>>>>> br vector.body
>>>>>
>>>>> vector.body
>>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>>> %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>>>
>>>>>          ;; stepvector used for index identity on entry to loop body ;;
>>>>> %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>>>>>                                    [ %stepvector, %vector.body.preheader ]
>>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>>> %vscale32 = trunc i64 %vscale64 to i32
>>>>> %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>>>>>
>>>>>          ;; vscale splat used to increment identity vector ;;
>>>>> %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>>>>> %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>>> %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>>>>> %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>>> %3 = bitcast i32* %2 to <scalable 4 x i32>*
>>>>> store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>>>>>
>>>>>          ;; vscale used to increment loop index
>>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>>> %4 = icmp eq i64 %index.next, %n.vec
>>>>> br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>>>> ``
>>>>>
>>>>> ==========
>>>>> 8. Patches
>>>>> ==========
>>>>>
>>>>> List of patches:
>>>>>
>>>>> 1. Extend VectorType: https://reviews.llvm.org/D32530
>>>>> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
>>>>> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
>>>>> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>>>>> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>>>>> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>>>>> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>>>>> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>>>>> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>>>>> 10. Initial store patterns: https://reviews.llvm.org/D47776
>>>>> 11. Initial addition patterns: https://reviews.llvm.org/D47777
>>>>> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>>>>> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
>>>>> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>>>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi,

Sorry for the delay, but I now have an initial implementation of size queries
for scalable types on phabricator:

https://reviews.llvm.org/D53137 and https://reviews.llvm.org/D53138

This isn't complete (I haven't used the DataLayout version outside of the tests),
but I'd like to get some feedback before making further changes.

Some notes/questions:

  1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
     and changed all uses of it with vector types to use
     Type::getScalableSizeInBits instead, following the design in the RFC.

     While this shows where getPrimitiveSizeInBits is used with vector types,
     I think it would be better for now to change it back to avoid breaking
     existing targets and put an assert in it to ensure that it's only used
     on non-scalable vectors. We can revisit the decision later once
     scalable vector support is more mature. Thoughts?

  2. There are two implementations of ScalableSize; one in Type.h, and one
     in DataLayout.h. I'd prefer to only have one, but the former reports
     sizes as 32bits while the latter uses 64bits.

     I think changing all size queries to use 64bits is the best way to
     resolve it -- are there any significant problems with that approach
     aside from lots of code churn?

     It would also be possible to use templates and typedefs, but I figure
     unifying size reporting would be better.

  3. I have only implemented 'strict' comparisons for now, which leads to
     some possibly-surprising results; {X, 0} compared with {0, X} will
     return false for both '==' and '<' comparisons, but true for '<='.

     I think that supporting 'maybe' results from overloaded operators
     would be a bad idea, so if/when we find cases where they are needed
     then I think new functions should be written to cover those cases
     and only used where it matters. For simple things like stopping
     casts between scalable and non-scalable vectors the strict
     comparisons should suffice.

  4. Alignment for structs containing scalable types is tricky. For now,
     I've added an assert to force all structs containing scalable vectors
     to be packed.

     It won't be possible to calculate correct offsets at compile time if
     the minimum size of a struct member isn't a multiple of the required
     alignment for the subsequent element(s).

     Instead, a runtime calculation will be required. This could arise in
     SVE if a predicate register (minimum 2 bytes) were used followed by
     an aligned data vector -- it could be aligned, but it could also
     require adding up to 14 bytes of padding to reach minimum alignment
     for data vectors.

     The proposed ACLE does allow creating sizeless structs with both
     predicate and data registers so we can't forbid such structs, but it
     makes no guarantees about alignment -- it's implementation defined.

     Do any of the other architectures with scalable vectors have any
     particular requirements for this?

  5. The comparison operators contain all cases within them. Would it be
     preferable to just keep the initial case (scalable terms equal and
     likely zero) in the header for inlining, and move all other cases
     into another function elsewhere to reduce code bloat a bit?

  6. Any suggestions for better names?

  7. Would it be beneficial to put the RFC in a phabricator review to make
     it easier to see changes?

  8. I will be at the devmeeting next week, so if anyone wants to chat
     about scalable vector support that would be very welcome.


-Graham

> On 1 Aug 2018, at 20:43, Hal Finkel <[hidden email]> wrote:
>
>
> On 08/01/2018 02:00 PM, Graham Hunter wrote:
>> Hi Hal,
>>
>>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>>
>>>
>>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>>
>>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>>
>>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
>> Thanks, that's good to hear.
>>
>>> 1.
>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>>> analysis of IR. While it has not been implemented in full,
>>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
>> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
>
> At least on this point, I think that we'll want to have the
> implementation to help make sure there aren't important details we're
> overlooking.
>
>>
>>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>
> I think that this will likely work, although I think we want to invert
> the sense of the attribute. vscale should be inherited by default, and
> some attribute can say that this isn't so. That same attribute, I
> imagine, will also forbid scalable vector function arguments and return
> values on those functions. If we don't have inherited vscale as the
> default, we place an implicit contract on any IR transformation hat
> performs outlining that it needs to scan for certain kinds of vector
> operations and add the special attribute, or just always add this
> special attribute, and that just becomes another special case, which
> will only actually manifest on certain platforms, that it's best to avoid.
>
>>
>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>
> My point is that, while there may be some sense in which the details can
> be worked out later, we need to have a good-enough understanding of how
> this will work now in order to make sure that we're not making design
> decisions now that make handling the dynamic vscale in a reasonable way
> later more difficult.
>
> Thanks again,
> Hal
>
>>
>> -Graham
>>
>>> Thanks again,
>>> Hal
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
Hi Graham,

On Thu, 11 Oct 2018 at 15:14, Graham Hunter <[hidden email]> wrote:
>   1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
>      and changed all uses of it with vector types to use
>      Type::getScalableSizeInBits instead, following the design in the RFC.
>
>      While this shows where getPrimitiveSizeInBits is used with vector types,
>      I think it would be better for now to change it back to avoid breaking
>      existing targets and put an assert in it to ensure that it's only used
>      on non-scalable vectors. We can revisit the decision later once
>      scalable vector support is more mature. Thoughts?

Another solution would be to make it return ScalableSize.Unscaled. At
least in a transition period.


>   2. There are two implementations of ScalableSize; one in Type.h, and one
>      in DataLayout.h. I'd prefer to only have one, but the former reports
>      sizes as 32bits while the latter uses 64bits.
>
>      I think changing all size queries to use 64bits is the best way to
>      resolve it -- are there any significant problems with that approach
>      aside from lots of code churn?
>
>      It would also be possible to use templates and typedefs, but I figure
>      unifying size reporting would be better.

Agreed.


>   3. I have only implemented 'strict' comparisons for now, which leads to
>      some possibly-surprising results; {X, 0} compared with {0, X} will
>      return false for both '==' and '<' comparisons, but true for '<='.
>
>      I think that supporting 'maybe' results from overloaded operators
>      would be a bad idea, so if/when we find cases where they are needed
>      then I think new functions should be written to cover those cases
>      and only used where it matters. For simple things like stopping
>      casts between scalable and non-scalable vectors the strict
>      comparisons should suffice.

How do you differentiate between maybe and certain?

Asserts making sure you never compare scalable with non-scalable in
the wrong way would be heavy handed, but are the only sure way to
avoid this pitfall.

A handler to make those comparisons safe (for example, returning
safety breach via argument pointer) would be lighter, but require big
code changes and won't work with overloaded operators.


>   4. Alignment for structs containing scalable types is tricky. For now,
>      I've added an assert to force all structs containing scalable vectors
>      to be packed.

I take it by "alignment" you mean element size (== structure size),
not structure alignment, which IIUC, only depends on the ABI.

I remember vaguely that scalable vectors' alignment in memory is the
same as the unit vector's, and the unit vector is known at compile
time, just not the multiplicity.

Did I get that wrong?


>      It won't be possible to calculate correct offsets at compile time if
>      the minimum size of a struct member isn't a multiple of the required
>      alignment for the subsequent element(s).

I assume this would be either an ABI decision or an extension to the
standard, but we can re-use C99's VLA concepts, only here it's the
element size that is unknown, not just the element count.

This would keep the costs of unknown offsets until runtime to a minimal.

It would also make sure undefined behaviour while accessing
out-of-bounds offsets in a structure with SVE types break consistently
and early. :)

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
Hi Renato,

Thanks for taking a look.

> On 11 Oct 2018, at 15:57, Renato Golin <[hidden email]> wrote:
>
> Hi Graham,
>
> On Thu, 11 Oct 2018 at 15:14, Graham Hunter <[hidden email]> wrote:
>>  1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
>>     and changed all uses of it with vector types to use
>>     Type::getScalableSizeInBits instead, following the design in the RFC.
>>
>>     While this shows where getPrimitiveSizeInBits is used with vector types,
>>     I think it would be better for now to change it back to avoid breaking
>>     existing targets and put an assert in it to ensure that it's only used
>>     on non-scalable vectors. We can revisit the decision later once
>>     scalable vector support is more mature. Thoughts?
>
> Another solution would be to make it return ScalableSize.Unscaled. At
> least in a transition period.

True, though there are places in the code that expect a size of 0 to mean
"this is a pointer", so using scalable vectors with that could lead to
incorrect code being generated instead of an obvious ICE.

>>  2. There are two implementations of ScalableSize; one in Type.h, and one
>>     in DataLayout.h. I'd prefer to only have one, but the former reports
>>     sizes as 32bits while the latter uses 64bits.
>>
>>     I think changing all size queries to use 64bits is the best way to
>>     resolve it -- are there any significant problems with that approach
>>     aside from lots of code churn?
>>
>>     It would also be possible to use templates and typedefs, but I figure
>>     unifying size reporting would be better.
>
> Agreed.
>
>
>>  3. I have only implemented 'strict' comparisons for now, which leads to
>>     some possibly-surprising results; {X, 0} compared with {0, X} will
>>     return false for both '==' and '<' comparisons, but true for '<='.
>>
>>     I think that supporting 'maybe' results from overloaded operators
>>     would be a bad idea, so if/when we find cases where they are needed
>>     then I think new functions should be written to cover those cases
>>     and only used where it matters. For simple things like stopping
>>     casts between scalable and non-scalable vectors the strict
>>     comparisons should suffice.
>
> How do you differentiate between maybe and certain?

This work is biased towards 'true' being valid if and only if the condition
holds for all possible values of vscale. This does mean that returning a
'false' in some cases may be incorrect, since the result could be true for
some (but not all) vscale values.

I don't know if 'maybe' results are useful on their own yet.

> Asserts making sure you never compare scalable with non-scalable in
> the wrong way would be heavy handed, but are the only sure way to
> avoid this pitfall.
>
> A handler to make those comparisons safe (for example, returning
> safety breach via argument pointer) would be lighter, but require big
> code changes and won't work with overloaded operators.

My initial intention was for most existing code (especially in target
specific code for targets without scalable vectors) to continue using
the unscaled-only interfaces; there's also common code which is guarded
by a check for scalar types before querying size. I haven't counted up
all the cases that would need to change, but the majority will be fine
as is.

Do you think that implementing the comparisons without operator overloading
would be preferable? I know that APInt does this, so it wouldn't be
unprecedented in the codebase -- I was just trying to fit the existing code
without changing too much, but maybe that's the wrong approach.

Either passing in a pointer as you suggest, or returning an 'ErrorOr<bool>'
as a result would allow appropriate boolean results through and require
the calling code to handle 'maybes' (which could just mean bailing out of
whatever transformation that was about to be performed).

I'll take a look through some uses of DataLayout to see how well that would
work.

>>  4. Alignment for structs containing scalable types is tricky. For now,
>>     I've added an assert to force all structs containing scalable vectors
>>     to be packed.
>
> I take it by "alignment" you mean element size (== structure size),
> not structure alignment, which IIUC, only depends on the ABI.

I mean alignment of elements within a struct, which does indeed determine
structure size.

> I remember vaguely that scalable vectors' alignment in memory is the
> same as the unit vector's, and the unit vector is known at compile
> time, just not the multiplicity.
>
> Did I get that wrong?

That's correct, but data vectors (Z registers) and predicate vectors
(P registers) have different unit vector sizes: 128bits vs 16bits,
respectively.

We could insist that predicate vectors take up the same space as data
vectors, but that will waste some space.

>>     It won't be possible to calculate correct offsets at compile time if
>>     the minimum size of a struct member isn't a multiple of the required
>>     alignment for the subsequent element(s).
>
> I assume this would be either an ABI decision or an extension to the
> standard, but we can re-use C99's VLA concepts, only here it's the
> element size that is unknown, not just the element count.
>
> This would keep the costs of unknown offsets until runtime to a minimal.

Sure, it's something to handle at the ABI level, so I'd like to know if
RVV or NEC's vector architecture have any special requirements here.

I would hope that sufficient advice to the programmer would avoid this
being a common problem and predicate vectors were always placed after
data vectors, but we do need to make sure it will work the other way
round.

-Graham

>
> It would also make sure undefined behaviour while accessing
> out-of-bounds offsets in a structure with SVE types break consistently
> and early. :)
>
> cheers,
> --renato

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Mon, 15 Oct 2018 at 12:04, Graham Hunter <[hidden email]> wrote:
> > Another solution would be to make it return ScalableSize.Unscaled. At
> > least in a transition period.
>
> True, though there are places in the code that expect a size of 0 to mean
> "this is a pointer", so using scalable vectors with that could lead to
> incorrect code being generated instead of an obvious ICE.

I see.


> This work is biased towards 'true' being valid if and only if the condition
> holds for all possible values of vscale. This does mean that returning a
> 'false' in some cases may be incorrect, since the result could be true for
> some (but not all) vscale values.

I wonder is there's any case where returning a wrong false would be
problematic.

I can't think of anything, so I agree with your approach. :)


> Do you think that implementing the comparisons without operator overloading
> would be preferable? I know that APInt does this, so it wouldn't be
> unprecedented in the codebase -- I was just trying to fit the existing code
> without changing too much, but maybe that's the wrong approach.

No, I think keep the code changes to a minimum is important.

And the problems will only be on scalable vs. non-scalable, which is
non-existent today, so not expecting anything current to break.


> I'll take a look through some uses of DataLayout to see how well that would
> work.

Thanks! If we can solve that in a simple way, good. If not, I don't
see it as a big deal, for now.


> That's correct, but data vectors (Z registers) and predicate vectors
> (P registers) have different unit vector sizes: 128bits vs 16bits,
> respectively.

Ah, I see. I imagine P regs will need padding to the maximum alignment
in (almost?) all cases.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi,


Following various discussions at the recent devmeeting, I've posted an RFC for
scalable vector IR type alone on phabricator: https://reviews.llvm.org/D53695

There's a couple of changes, and I posted that as a separate revision on top
of the previous text so changes are visible.

The main differences are:

  - Size comparisons between unscaled and scaled vector types are considered
    invalid for now, and will assert.

  - Scalable vector types cannot be members of StructTypes or ArrayTypes. If
    these are needed at the C level (e.g. the SVE ACLE C intrinsics), then
    clang must perform lowering to a pointer + vscale-based arithmetic instead
    of creating aggregates in IR.

I will update the IR type patch and size query patch soon.

-Graham


> On 11 Oct 2018, at 15:14, Graham Hunter via llvm-dev <[hidden email]> wrote:
>
> Hi,
>
> Sorry for the delay, but I now have an initial implementation of size queries
> for scalable types on phabricator:
>
> https://reviews.llvm.org/D53137 and https://reviews.llvm.org/D53138
>
> This isn't complete (I haven't used the DataLayout version outside of the tests),
> but I'd like to get some feedback before making further changes.
>
> Some notes/questions:
>
>  1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
>     and changed all uses of it with vector types to use
>     Type::getScalableSizeInBits instead, following the design in the RFC.
>
>     While this shows where getPrimitiveSizeInBits is used with vector types,
>     I think it would be better for now to change it back to avoid breaking
>     existing targets and put an assert in it to ensure that it's only used
>     on non-scalable vectors. We can revisit the decision later once
>     scalable vector support is more mature. Thoughts?
>
>  2. There are two implementations of ScalableSize; one in Type.h, and one
>     in DataLayout.h. I'd prefer to only have one, but the former reports
>     sizes as 32bits while the latter uses 64bits.
>
>     I think changing all size queries to use 64bits is the best way to
>     resolve it -- are there any significant problems with that approach
>     aside from lots of code churn?
>
>     It would also be possible to use templates and typedefs, but I figure
>     unifying size reporting would be better.
>
>  3. I have only implemented 'strict' comparisons for now, which leads to
>     some possibly-surprising results; {X, 0} compared with {0, X} will
>     return false for both '==' and '<' comparisons, but true for '<='.
>
>     I think that supporting 'maybe' results from overloaded operators
>     would be a bad idea, so if/when we find cases where they are needed
>     then I think new functions should be written to cover those cases
>     and only used where it matters. For simple things like stopping
>     casts between scalable and non-scalable vectors the strict
>     comparisons should suffice.
>
>  4. Alignment for structs containing scalable types is tricky. For now,
>     I've added an assert to force all structs containing scalable vectors
>     to be packed.
>
>     It won't be possible to calculate correct offsets at compile time if
>     the minimum size of a struct member isn't a multiple of the required
>     alignment for the subsequent element(s).
>
>     Instead, a runtime calculation will be required. This could arise in
>     SVE if a predicate register (minimum 2 bytes) were used followed by
>     an aligned data vector -- it could be aligned, but it could also
>     require adding up to 14 bytes of padding to reach minimum alignment
>     for data vectors.
>
>     The proposed ACLE does allow creating sizeless structs with both
>     predicate and data registers so we can't forbid such structs, but it
>     makes no guarantees about alignment -- it's implementation defined.
>
>     Do any of the other architectures with scalable vectors have any
>     particular requirements for this?
>
>  5. The comparison operators contain all cases within them. Would it be
>     preferable to just keep the initial case (scalable terms equal and
>     likely zero) in the header for inlining, and move all other cases
>     into another function elsewhere to reduce code bloat a bit?
>
>  6. Any suggestions for better names?
>
>  7. Would it be beneficial to put the RFC in a phabricator review to make
>     it easier to see changes?
>
>  8. I will be at the devmeeting next week, so if anyone wants to chat
>     about scalable vector support that would be very welcome.
>
>
> -Graham
>
>> On 1 Aug 2018, at 20:43, Hal Finkel <[hidden email]> wrote:
>>
>>
>> On 08/01/2018 02:00 PM, Graham Hunter wrote:
>>> Hi Hal,
>>>
>>>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>>>
>>>>
>>>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>>>
>>>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>>>
>>>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>>>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
>>> Thanks, that's good to hear.
>>>
>>>> 1.
>>>>> This is a proposal for how to deal with querying the size of scalable types for
>>>>>> analysis of IR. While it has not been implemented in full,
>>>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
>>> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
>>
>> At least on this point, I think that we'll want to have the
>> implementation to help make sure there aren't important details we're
>> overlooking.
>>
>>>
>>>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>>> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>>
>> I think that this will likely work, although I think we want to invert
>> the sense of the attribute. vscale should be inherited by default, and
>> some attribute can say that this isn't so. That same attribute, I
>> imagine, will also forbid scalable vector function arguments and return
>> values on those functions. If we don't have inherited vscale as the
>> default, we place an implicit contract on any IR transformation hat
>> performs outlining that it needs to scan for certain kinds of vector
>> operations and add the special attribute, or just always add this
>> special attribute, and that just becomes another special case, which
>> will only actually manifest on certain platforms, that it's best to avoid.
>>
>>>
>>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>>
>> My point is that, while there may be some sense in which the details can
>> be worked out later, we need to have a good-enough understanding of how
>> this will work now in order to make sure that we're not making design
>> decisions now that make handling the dynamic vscale in a reasonable way
>> later more difficult.
>>
>> Thanks again,
>> Hal
>>
>>>
>>> -Graham
>>>
>>>> Thanks again,
>>>> Hal
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Thu, 25 Oct 2018 at 14:09, Graham Hunter <[hidden email]> wrote:
>   - Size comparisons between unscaled and scaled vector types are considered
>     invalid for now, and will assert.

Sounds like a safe approach.


>   - Scalable vector types cannot be members of StructTypes or ArrayTypes. If
>     these are needed at the C level (e.g. the SVE ACLE C intrinsics), then
>     clang must perform lowering to a pointer + vscale-based arithmetic instead
>     of creating aggregates in IR.

Perhaps also mark them as noalias, given that the front-end has full
control over its lifetime?


> I will update the IR type patch and size query patch soon.

I'll have a look, thanks!
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi Graham,

I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it. If I missed something, please let me know.

Thanks,
JinGu Kang


From: llvm-dev <[hidden email]> on behalf of Graham Hunter via llvm-dev <[hidden email]>
Sent: 05 June 2018 14:15
To: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker
Cc: nd
Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
 
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In the RISC-V V extension, there is no upper limit to the size vector
registers can be in a future CPU. (Formally, the upper limit is at
least 2^31 bytes)

Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
<[hidden email]> wrote:

>
> Hi Graham,
>
> I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it. If I missed something, please let me know.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Graham Hunter via llvm-dev <[hidden email]>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an updated
> RFC for variable length vector support with a set of 14 patches (listed at the end)
> to demonstrate code generation for SVE using the extensions proposed in the RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension alongside
> SVE; I'll send an email with some additional comments on Robin's RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>    have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE instructions
>    for a loop that has already been vectorized using the extensions described
>    in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer constant.
> For scalable types a second integer is needed to indicate the number of bytes
> which need to be scaled by the runtime multiple to obtain the actual length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not need to
> deal with scalable types, another function (getFixedSizeExpressionInBits) that
> only returns unscaled bits will be provided, with a debug assert that the type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled sizes
> are used. Comparisons between scaled sizes is also simple when comparing sizes
> within a function (or across functions with the inherit flag mentioned in the
> changes to the type), but cannot be compared otherwise. If a mix is present,
> then any number of unscaled bits will not be considered to have a greater size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled bits
> (since the runtime multiple is at least one).
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ==================
> 5. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 6. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 7. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
Hi Bruce,

Thanks for your comment.

> Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

I am not sure what the Generic code is. It seems it uses similar way as the implementation of variable length array. If so, I think it could be one way to support it.

Thanks,
JinGu Kang


From: Bruce Hoult <[hidden email]>
Sent: 24 May 2019 20:12
To: JinGu Kang
Cc: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker; Graham Hunter; nd
Subject: Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
 
In the RISC-V V extension, there is no upper limit to the size vector
registers can be in a future CPU. (Formally, the upper limit is at
least 2^31 bytes)

Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
<[hidden email]> wrote:
>
> Hi Graham,
>
> I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it. If I missed something, please let me know.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Graham Hunter via llvm-dev <[hidden email]>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an updated
> RFC for variable length vector support with a set of 14 patches (listed at the end)
> to demonstrate code generation for SVE using the extensions proposed in the RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension alongside
> SVE; I'll send an email with some additional comments on Robin's RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>    have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE instructions
>    for a loop that has already been vectorized using the extensions described
>    in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer constant.
> For scalable types a second integer is needed to indicate the number of bytes
> which need to be scaled by the runtime multiple to obtain the actual length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not need to
> deal with scalable types, another function (getFixedSizeExpressionInBits) that
> only returns unscaled bits will be provided, with a debug assert that the type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled sizes
> are used. Comparisons between scaled sizes is also simple when comparing sizes
> within a function (or across functions with the inherit flag mentioned in the
> changes to the type), but cannot be compared otherwise. If a mix is present,
> then any number of unscaled bits will not be considered to have a greater size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled bits
> (since the runtime multiple is at least one).
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ==================
> 5. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 6. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 7. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
By "generic code" I mean program instructions that is guaranteed to
run correctly on any RISC-V CPU with the V extension, now or in the
far future.

On Fri, May 24, 2019 at 1:47 PM JinGu Kang <[hidden email]> wrote:

>
> Hi Bruce,
>
> Thanks for your comment.
>
> > Generic code can enquire the size, dynamically allocate space, and
> transparently save and restore the contents of a vector register or
> registers.
>
> I am not sure what the Generic code is. It seems it uses similar way as the implementation of variable length array. If so, I think it could be one way to support it.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: Bruce Hoult <[hidden email]>
> Sent: 24 May 2019 20:12
> To: JinGu Kang
> Cc: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker; Graham Hunter; nd
> Subject: Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>
> In the RISC-V V extension, there is no upper limit to the size vector
> registers can be in a future CPU. (Formally, the upper limit is at
> least 2^31 bytes)
>
> Generic code can enquire the size, dynamically allocate space, and
> transparently save and restore the contents of a vector register or
> registers.
>
> On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
> <[hidden email]> wrote:
> >
> > Hi Graham,
> >
> > I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it. If I missed something, please let me know.
> >
> > Thanks,
> > JinGu Kang
> >
> > ________________________________
> > From: llvm-dev <[hidden email]> on behalf of Graham Hunter via llvm-dev <[hidden email]>
> > Sent: 05 June 2018 14:15
> > To: Chris Lattner; Hal Finkel; Jones, Joel; [hidden email]; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; [hidden email]; [hidden email]; Sjoerd Meijer; Sam Parker
> > Cc: nd
> > Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
> >
> > Hi,
> >
> > Now that Sander has committed enough MC support for SVE, here's an updated
> > RFC for variable length vector support with a set of 14 patches (listed at the end)
> > to demonstrate code generation for SVE using the extensions proposed in the RFC.
> >
> > I have some ideas about how to support RISC-V's upcoming extension alongside
> > SVE; I'll send an email with some additional comments on Robin's RFC later.
> >
> > Feedback and questions welcome.
> >
> > -Graham
> >
> > =============================================================
> > Supporting SIMD instruction sets with variable vector lengths
> > =============================================================
> >
> > In this RFC we propose extending LLVM IR to support code-generation for variable
> > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> > approach is backwards compatible and should be as non-intrusive as possible; the
> > only change needed in other backends is how size is queried on vector types, and
> > it only requires a change in which function is called. We have created a set of
> > proof-of-concept patches to represent a simple vectorized loop in IR and
> > generate SVE instructions from that IR. These patches (listed in section 7 of
> > this rfc) can be found on Phabricator and are intended to illustrate the scope
> > of changes required by the general approach described in this RFC.
> >
> > ==========
> > Background
> > ==========
> >
> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> > AArch64 which is intended to scale with hardware such that the same binary
> > running on a processor with longer vector registers can take advantage of the
> > increased compute power without recompilation.
> >
> > As the vector length is no longer a compile-time known value, the way in which
> > the LLVM vectorizer generates code requires modifications such that certain
> > values are now runtime evaluated expressions instead of compile-time constants.
> >
> > Documentation for SVE can be found at
> > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
> >
> > ========
> > Contents
> > ========
> >
> > The rest of this RFC covers the following topics:
> >
> > 1. Types -- a proposal to extend VectorType to be able to represent vectors that
> >    have a length which is a runtime-determined multiple of a known base length.
> >
> > 2. Size Queries - how to reason about the size of types for which the size isn't
> >    fully known at compile time.
> >
> > 3. Representing the runtime multiple of vector length in IR for use in address
> >    calculations and induction variable comparisons.
> >
> > 4. Generating 'constant' values in IR for vectors with a runtime-determined
> >    number of elements.
> >
> > 5. A brief note on code generation of these new operations for AArch64.
> >
> > 6. An example of C code and matching IR using the proposed extensions.
> >
> > 7. A list of patches demonstrating the changes required to emit SVE instructions
> >    for a loop that has already been vectorized using the extensions described
> >    in this RFC.
> >
> > ========
> > 1. Types
> > ========
> >
> > To represent a vector of unknown length a boolean `Scalable` property has been
> > added to the `VectorType` class, which indicates that the number of elements in
> > the vector is a runtime-determined integer multiple of the `NumElements` field.
> > Most code that deals with vectors doesn't need to know the exact length, but
> > does need to know relative lengths -- e.g. get a vector with the same number of
> > elements but a different element type, or with half or double the number of
> > elements.
> >
> > In order to allow code to transparently support scalable vectors, we introduce
> > an `ElementCount` class with two members:
> >
> > - `unsigned Min`: the minimum number of elements.
> > - `bool Scalable`: is the element count an unknown multiple of `Min`?
> >
> > For non-scalable vectors (``Scalable=false``) the scale is considered to be
> > equal to one and thus `Min` represents the exact number of elements in the
> > vector.
> >
> > The intent for code working with vectors is to use convenience methods and avoid
> > directly dealing with the number of elements. If needed, calling
> > `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> > to obtain the (potentially scalable) number of elements. Overloaded division and
> > multiplication operators allow an ElementCount instance to be used in much the
> > same manner as an integer for most cases.
> >
> > This mixture of compile-time and runtime quantities allow us to reason about the
> > relationship between different scalable vector types without knowing their
> > exact length.
> >
> > The runtime multiple is not expected to change during program execution for SVE,
> > but it is possible. The model of scalable vectors presented in this RFC assumes
> > that the multiple will be constant within a function but not necessarily across
> > functions. As suggested in the recent RISC-V rfc, a new function attribute to
> > inherit the multiple across function calls will allow for function calls with
> > vector arguments/return values and inlining/outlining optimizations.
> >
> > IR Textual Form
> > ---------------
> >
> > The textual form for a scalable vector is:
> >
> > ``<scalable <n> x <type>>``
> >
> > where `type` is the scalar type of each element, `n` is the minimum number of
> > elements, and the string literal `scalable` indicates that the total number of
> > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> > for indicating that the vector is scalable, and could be substituted by another.
> > For fixed-length vectors, the `scalable` is omitted, so there is no change in
> > the format for existing vectors.
> >
> > Scalable vectors with the same `Min` value have the same number of elements, and
> > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> > used within the same function):
> >
> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
> >   elements.
> >
> > ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
> >   bytes.
> >
> > IR Bitcode Form
> > ---------------
> >
> > To serialize scalable vectors to bitcode, a new boolean field is added to the
> > type record. If the field is not present the type will default to a fixed-length
> > vector type, preserving backwards compatibility.
> >
> > Alternatives Considered
> > -----------------------
> >
> > We did consider one main alternative -- a dedicated target type, like the
> > x86_mmx type.
> >
> > A dedicated target type would either need to extend all existing passes that
> > work with vectors to recognize the new type, or to duplicate all that code
> > in order to get reasonable code generation and autovectorization.
> >
> > This hasn't been done for the x86_mmx type, and so it is only capable of
> > providing support for C-level intrinsics instead of being used and recognized by
> > passes inside llvm.
> >
> > Although our current solution will need to change some of the code that creates
> > new VectorTypes, much of that code doesn't need to care about whether the types
> > are scalable or not -- they can use preexisting methods like
> > `getHalfElementsVectorType`. If the code is a little more complex,
> > `ElementCount` structs can be used instead of an `unsigned` value to represent
> > the number of elements.
> >
> > ===============
> > 2. Size Queries
> > ===============
> >
> > This is a proposal for how to deal with querying the size of scalable types.
> > While it has not been implemented in full, the general approach works well
> > for calculating offsets into structures with scalable types in a modified
> > version of ComputeValueVTs in our downstream compiler.
> >
> > Current IR types that have a known size all return a single integer constant.
> > For scalable types a second integer is needed to indicate the number of bytes
> > which need to be scaled by the runtime multiple to obtain the actual length.
> >
> > For primitive types, getPrimitiveSizeInBits will function as it does today,
> > except that it will no longer return a size for vector types (it will return 0,
> > as it does for other derived types). The majority of calls to this function are
> > already for scalar rather than vector types.
> >
> > For derived types, a function (getSizeExpressionInBits) to return a pair of
> > integers (one to indicate unscaled bits, the other for bits that need to be
> > scaled by the runtime multiple) will be added. For backends that do not need to
> > deal with scalable types, another function (getFixedSizeExpressionInBits) that
> > only returns unscaled bits will be provided, with a debug assert that the type
> > isn't scalable.
> >
> > Similar functionality will be added to DataLayout.
> >
> > Comparing two of these sizes together is straightforward if only unscaled sizes
> > are used. Comparisons between scaled sizes is also simple when comparing sizes
> > within a function (or across functions with the inherit flag mentioned in the
> > changes to the type), but cannot be compared otherwise. If a mix is present,
> > then any number of unscaled bits will not be considered to have a greater size
> > than a smaller number of scaled bits, but a smaller number of unscaled bits
> > will be considered to have a smaller size than a greater number of scaled bits
> > (since the runtime multiple is at least one).
> >
> > Future Work
> > -----------
> >
> > Since we cannot determine the exact size of a scalable vector, the
> > existing logic for alias detection won't work when multiple accesses
> > share a common base pointer with different offsets.
> >
> > However, SVE's predication will mean that a dynamic 'safe' vector length
> > can be determined at runtime, so after initial support has been added we
> > can work on vectorizing loops using runtime predication to avoid aliasing
> > problems.
> >
> > Alternatives Considered
> > -----------------------
> >
> > Marking scalable vectors as unsized doesn't work well, as many parts of
> > llvm dealing with loads and stores assert that 'isSized()' returns true
> > and make use of the size when calculating offsets.
> >
> > We have considered introducing multiple helper functions instead of
> > using direct size queries, but that doesn't cover all cases. It may
> > still be a good idea to introduce them to make the purpose in a given
> > case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
> >
> > ========================================
> > 3. Representing Vector Length at Runtime
> > ========================================
> >
> > With a scalable vector type defined, we now need a way to represent the runtime
> > length in IR in order to generate addresses for consecutive vectors in memory
> > and determine how many elements have been processed in an iteration of a loop.
> >
> > We have added an experimental `vscale` intrinsic to represent the runtime
> > multiple. Multiplying the result of this intrinsic by the minimum number of
> > elements in a vector gives the total number of elements in a scalable vector.
> >
> > Fixed-Length Code
> > -----------------
> >
> > Assuming a vector type of <4 x <ty>>
> > ``
> > vector.body:
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >   ;; <loop body>
> >   ;; Increment induction var
> >   %index.next = add i64 %index, 4
> >   ;; <check and branch>
> > ``
> > Scalable Equivalent
> > -------------------
> >
> > Assuming a vector type of <scalable 4 x <ty>>
> > ``
> > vector.body:
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >   ;; <loop body>
> >   ;; Increment induction var
> >   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >   ;; <check and branch>
> > ``
> > ===========================
> > 4. Generating Vector Values
> > ===========================
> > For constant vector values, we cannot specify all the elements as we can for
> > fixed-length vectors; fortunately only a small number of easily synthesized
> > patterns are required for autovectorization. The `zeroinitializer` constant
> > can be used in the same manner as fixed-length vectors for a constant zero
> > splat. This can then be combined with `insertelement` and `shufflevector`
> > to create arbitrary value splats in the same manner as fixed-length vectors.
> >
> > For constants consisting of a sequence of values, an experimental `stepvector`
> > intrinsic has been added to represent a simple constant of the form
> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> > start can be added, and changing the step requires multiplying by a splat.
> >
> > Fixed-Length Code
> > -----------------
> > ``
> >   ;; Splat a value
> >   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
> >   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
> >   ;; Add a constant sequence
> >   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> > ``
> > Scalable Equivalent
> > -------------------
> > ``
> >   ;; Splat a value
> >   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
> >   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   ;; Splat offset + stride (the same in this case)
> >   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
> >   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   ;; Create sequence for scalable vector
> >   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
> >   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
> >   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
> >   ;; Add the runtime-generated sequence
> >   %add = add <scalable 4 x i32> %splat, %addoffset
> > ``
> > Future Work
> > -----------
> >
> > Intrinsics cannot currently be used for constant folding. Our downstream
> > compiler (using Constants instead of intrinsics) relies quite heavily on this
> > for good code generation, so we will need to find new ways to recognize and
> > fold these values.
> >
> > ==================
> > 5. Code Generation
> > ==================
> >
> > IR splats will be converted to an experimental splatvector intrinsic in
> > SelectionDAGBuilder.
> >
> > All three intrinsics are custom lowered and legalized in the AArch64 backend.
> >
> > Two new AArch64ISD nodes have been added to represent the same concepts
> > at the SelectionDAG level, while splatvector maps onto the existing
> > AArch64ISD::DUP.
> >
> > GlobalISel
> > ----------
> >
> > Since GlobalISel was enabled by default on AArch64, it was necessary to add
> > scalable vector support to the LowLevelType implementation. A single bit was
> > added to the raw_data representation for vectors and vectors of pointers.
> >
> > In addition, types that only exist in destination patterns are planted in
> > the enumeration of available types for generated code. While this may not be
> > necessary in future, generating an all-true 'ptrue' value was necessary to
> > convert a predicated instruction into an unpredicated one.
> >
> > ==========
> > 6. Example
> > ==========
> >
> > The following example shows a simple C loop which assigns the array index to
> > the array elements matching that index. The IR shows how vscale and stepvector
> > are used to create the needed values and to advance the index variable in the
> > loop.
> >
> > C Code
> > ------
> >
> > ``
> > void IdentityArrayInit(int *a, int count) {
> >   for (int i = 0; i < count; ++i)
> >     a[i] = i;
> > }
> > ``
> >
> > Scalable IR Vector Body
> > -----------------------
> >
> > ``
> > vector.body.preheader:
> >   ;; Other setup
> >   ;; Stepvector used to create initial identity vector
> >   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
> >   br vector.body
> >
> > vector.body
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
> >
> >            ;; stepvector used for index identity on entry to loop body ;;
> >   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
> >                                      [ %stepvector, %vector.body.preheader ]
> >   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >   %vscale32 = trunc i64 %vscale64 to i32
> >   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
> >
> >            ;; vscale splat used to increment identity vector ;;
> >   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
> >   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
> >   %2 = getelementptr inbounds i32, i32* %a, i64 %0
> >   %3 = bitcast i32* %2 to <scalable 4 x i32>*
> >   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
> >
> >            ;; vscale used to increment loop index
> >   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >   %4 = icmp eq i64 %index.next, %n.vec
> >   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> > ``
> >
> > ==========
> > 7. Patches
> > ==========
> >
> > List of patches:
> >
> > 1. Extend VectorType: https://reviews.llvm.org/D32530
> > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> > 10. Initial store patterns: https://reviews.llvm.org/D47776
> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > [hidden email]
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > _______________________________________________
> > LLVM Developers mailing list
> > [hidden email]
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
1234