[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Hi Graham,

Just a few initial comments.

Graham Hunter <[hidden email]> writes:

> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>   bytes.

"scalable" instead of "scalable x."

> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not need to
> deal with scalable types, another function (getFixedSizeExpressionInBits) that
> only returns unscaled bits will be provided, with a debug assert that the type
> isn't scalable.

Can you explain a bit about what the two integers represent?  What's the
"unscaled" part for?

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale).  I would be surprised to get a pair of integers back.  Do
clients actually need constant integer values or would a ConstantExpr
sufffice?  We could add a ConstantVScale or something to make it work.

> Comparing two of these sizes together is straightforward if only unscaled sizes
> are used. Comparisons between scaled sizes is also simple when comparing sizes
> within a function (or across functions with the inherit flag mentioned in the
> changes to the type), but cannot be compared otherwise. If a mix is present,
> then any number of unscaled bits will not be considered to have a greater size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled bits
> (since the runtime multiple is at least one).

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison.  We have code here that adds ConstantExpr support to
ScalarEvolution.  We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.

> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.

I think this may be a case where added a full-fledged Instruction might
be worthwhile.  Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important.  I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.

> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.

This is another case where an Instruction might be better, for the same
reasons as with vscale.

Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept.  It's also used by C++
and go and so should be familiar to many people.

> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.

As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota).  They won't fold to compile-time values but the
expressions could be simplified.  I haven't really thought through the
implications of this, just brainstorming ideas.  What does your
downstream compiler require in terms of constant support.  What kinds of
queries does it need to do?

                         -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
On 5 June 2018 at 16:23,  <[hidden email]> wrote:
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.

Same here.


> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.

This sounds like a cleaner solution.


> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
...
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.

This is a side-effect of the original RFC a few years ago. The general
feeling was that we can start with intrinsics and, if they work, we
change the IR.

We can have a work-in-progress implementation before fully committing
SCEV and other more core ideas in, and then slowly, and with more
certainty, move where it makes more sense.


> Also, "iota" is the name Cray has traditionally used for this operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.

That sounds better, but stepvector is more than C++'s iota and it's
just a plain scalar evolution sequence like {start, op, step}. In
C++'s iota (AFAICS), the step is always 1.

Anyway, I don't mind any name, really. Whatever is more mnemonic.


>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off

Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
see this restricted to inlined syntax. The sequence { step*vec +
offset } only makes sense in the scalable context and the intermediate
results should not be used elsewhere.


--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Renato Golin <[hidden email]> writes:

>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>
> This is a side-effect of the original RFC a few years ago. The general
> feeling was that we can start with intrinsics and, if they work, we
> change the IR.
>
> We can have a work-in-progress implementation before fully committing
> SCEV and other more core ideas in, and then slowly, and with more
> certainty, move where it makes more sense.

Ok, that makes sense.  I do think the goal should be making these proper
Instructions.

>> Also, "iota" is the name Cray has traditionally used for this operation
>> as it is the mathematical name for the concept.  It's also used by C++
>> and go and so should be familiar to many people.
>
> That sounds better, but stepvector is more than C++'s iota and it's
> just a plain scalar evolution sequence like {start, op, step}. In
> C++'s iota (AFAICS), the step is always 1.

I thought stepvector was also always step one, as Graham states a
multiply by a constant splat must be used for other step values.

> Anyway, I don't mind any name, really. Whatever is more mnemonic.

Me neither.  I was simply noting some of the history surrounding the
operation and naming in other familiar places.

>>  ;; Create sequence for scalable vector
>>  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>
> Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
> see this restricted to inlined syntax. The sequence { step*vec +
> offset } only makes sense in the scalable context and the intermediate
> results should not be used elsewhere.

I'm not so sure.  iota is a generally useful operation and scaling it to
various step values is also useful.  It's used often for strided memory
access, which would be done via gather/scatter in LLVM but generating a
vector GEP via stepvector would be convenient and convey more semantic
information than, say, loading a constant vector of indices to feed the
GEP.

                               -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi David,

Thanks for taking a look.

> On 5 Jun 2018, at 16:23, [hidden email] wrote:
>
> Hi Graham,
>
> Just a few initial comments.
>
> Graham Hunter <[hidden email]> writes:
>
>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>>  bytes.
>
> "scalable" instead of "scalable x."

Yep, missed that in the conversion from the old <n x m x ty> format.

>
>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>> integers (one to indicate unscaled bits, the other for bits that need to be
>> scaled by the runtime multiple) will be added. For backends that do not need to
>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>> only returns unscaled bits will be provided, with a debug assert that the type
>> isn't scalable.
>
> Can you explain a bit about what the two integers represent?  What's the
> "unscaled" part for?

'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.

>
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.  Do
> clients actually need constant integer values or would a ConstantExpr
> sufffice?  We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

>
>> Comparing two of these sizes together is straightforward if only unscaled sizes
>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>> within a function (or across functions with the inherit flag mentioned in the
>> changes to the type), but cannot be compared otherwise. If a mix is present,
>> then any number of unscaled bits will not be considered to have a greater size
>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>> will be considered to have a smaller size than a greater number of scaled bits
>> (since the runtime multiple is at least one).
>
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.  We have code here that adds ConstantExpr support to
> ScalarEvolution.  We just didn't know if anyone else would be interested
> in it since we added it solely for our Fortran frontend.

We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.

>
>> We have added an experimental `vscale` intrinsic to represent the runtime
>> multiple. Multiplying the result of this intrinsic by the minimum number of
>> elements in a vector gives the total number of elements in a scalable vector.
>
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.

We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.

>
>> For constants consisting of a sequence of values, an experimental `stepvector`
>> intrinsic has been added to represent a simple constant of the form
>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>> start can be added, and changing the step requires multiplying by a splat.
>
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
>
> Also, "iota" is the name Cray has traditionally used for this operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.

Iota would be fine with me; I forget the reason we didn't go with that initially. We
also had 'series_vector' in the past, but that was a more generic form with start
and step parameters instead of requiring additional IR instructions to multiply and
add for the result as we do for stepvector.

>
>> Future Work
>> -----------
>>
>> Intrinsics cannot currently be used for constant folding. Our downstream
>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>> for good code generation, so we will need to find new ways to recognize and
>> fold these values.
>
> As above, we could add ConstantVScale and also ConstantStepVector (or
> ConstantIota).  They won't fold to compile-time values but the
> expressions could be simplified.  I haven't really thought through the
> implications of this, just brainstorming ideas.  What does your
> downstream compiler require in terms of constant support.  What kinds of
> queries does it need to do?

It makes things a little easier to pattern match (just looking for a constant to start
instead of having to match multiple different forms of vscale or stepvector multiplied
and/or added in each place you're looking for them).

The bigger reason we currently depend on them being constant is that code generation
generally looks at a single block at a time, and there are several expressions using
vscale that we don't want to be generated in one block and passed around in a register,
since many of the load/store addressing forms for instructions will already scale properly.

We've done this downstream by having them be Constants, but if there's a good way
of doing them with intrinsics we'd be fine with that too.

-Graham


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
On 5 June 2018 at 18:38,  <[hidden email]> wrote:
> I thought stepvector was also always step one, as Graham states a
> multiply by a constant splat must be used for other step values.

You're right! Sorry, it's been a while. Step-vector is a simple iota,
the multiplier and offset come form the operations.


> I'm not so sure.  iota is a generally useful operation and scaling it to
> various step values is also useful.  It's used often for strided memory
> access, which would be done via gather/scatter in LLVM but generating a
> vector GEP via stepvector would be convenient and convey more semantic
> information than, say, loading a constant vector of indices to feed the
> GEP.

My point is that those patterns will be generated by C-level
intrinsics or IR optimisation passes (like vectorisers), so they have
a specific meaning in that context.

What I fear is if some other pass like CSE finds the patterns out and
common them up at the top of the function/BB and then the back-end
loses sight of what that was and can't generate the step increment
instruction in the end.

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev


> On Jun 5, 2018, at 11:25 AM, Graham Hunter <[hidden email]> wrote:
>
> Hi David,
>
> Thanks for taking a look.
>
>> On 5 Jun 2018, at 16:23, [hidden email] wrote:
>>
>> Hi Graham,
>>
>> Just a few initial comments.
>>
>> Graham Hunter <[hidden email]> writes:
>>
>>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>>> bytes.
>>
>> "scalable" instead of "scalable x."
>
> Yep, missed that in the conversion from the old <n x m x ty> format.
>
>>
>>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>>> integers (one to indicate unscaled bits, the other for bits that need to be
>>> scaled by the runtime multiple) will be added. For backends that do not need to
>>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>>> only returns unscaled bits will be provided, with a debug assert that the type
>>> isn't scalable.
>>
>> Can you explain a bit about what the two integers represent?  What's the
>> "unscaled" part for?
>
> 'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
> multiplied by vscale'.
>
>>
>> The name "getSizeExpressionInBits" makes me think that a Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale).  I would be surprised to get a pair of integers back.  Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice?  We could add a ConstantVScale or something to make it work.
>
> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
>
> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
> compile time like <4 x i32> the size would be (128, 0).
>
> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>
> When calculating the offset for memory addresses, you just need to multiply the scaled
> part by vscale and add the unscaled as is.
>
>>
>>> Comparing two of these sizes together is straightforward if only unscaled sizes
>>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>>> within a function (or across functions with the inherit flag mentioned in the
>>> changes to the type), but cannot be compared otherwise. If a mix is present,
>>> then any number of unscaled bits will not be considered to have a greater size
>>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>>> will be considered to have a smaller size than a greater number of scaled bits
>>> (since the runtime multiple is at least one).
>>
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison.  We have code here that adds ConstantExpr support to
>> ScalarEvolution.  We just didn't know if anyone else would be interested
>> in it since we added it solely for our Fortran frontend.
>
> We added a dedicated SCEV expression class for vscale instead; I suspect it works
> either way.
>
>>
>>> We have added an experimental `vscale` intrinsic to represent the runtime
>>> multiple. Multiplying the result of this intrinsic by the minimum number of
>>> elements in a vector gives the total number of elements in a scalable vector.
>>
>> I think this may be a case where added a full-fledged Instruction might
>> be worthwhile.  Because vscale is intimately tied to addressing, it
>> seems like things such as ScalarEvolution support will be important.  I
>> don't know what's involved in making intrinsics work with
>> ScalarEvolution but it seems strangely odd that a key component of IR
>> computation would live outside the IR proper, in the sense that all
>> other fundamental addressing operations are Instructions.
>
> We've tried it as both an instruction and as a 'Constant', and both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.
+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in SCEV.

>
>>
>>> For constants consisting of a sequence of values, an experimental `stepvector`
>>> intrinsic has been added to represent a simple constant of the form
>>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>>> start can be added, and changing the step requires multiplying by a splat.
>>
>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>>
>> Also, "iota" is the name Cray has traditionally used for this operation
>> as it is the mathematical name for the concept.  It's also used by C++
>> and go and so should be familiar to many people.
>
> Iota would be fine with me; I forget the reason we didn't go with that initially. We
> also had 'series_vector' in the past, but that was a more generic form with start
> and step parameters instead of requiring additional IR instructions to multiply and
> add for the result as we do for stepvector.
>
>>
>>> Future Work
>>> -----------
>>>
>>> Intrinsics cannot currently be used for constant folding. Our downstream
>>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>>> for good code generation, so we will need to find new ways to recognize and
>>> fold these values.
>>
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota).  They won't fold to compile-time values but the
>> expressions could be simplified.  I haven't really thought through the
>> implications of this, just brainstorming ideas.  What does your
>> downstream compiler require in terms of constant support.  What kinds of
>> queries does it need to do?
>
> It makes things a little easier to pattern match (just looking for a constant to start
> instead of having to match multiple different forms of vscale or stepvector multiplied
> and/or added in each place you're looking for them).
>
> The bigger reason we currently depend on them being constant is that code generation
> generally looks at a single block at a time, and there are several expressions using
> vscale that we don't want to be generated in one block and passed around in a register,
> since many of the load/store addressing forms for instructions will already scale properly.
>
> We've done this downstream by having them be Constants, but if there's a good way
> of doing them with intrinsics we'd be fine with that too.
>
> -Graham
>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Graham Hunter <[hidden email]> writes:

>> Can you explain a bit about what the two integers represent?  What's the
>> "unscaled" part for?
>
> 'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
> multiplied by vscale'.

Right, but what do they represent?  If I have <scalable 4 x i32> is "32"
"unscaled" and "4" "scaled?"  Or is "128" "scaled?"  Or something else?

I see you answered this below.

>> The name "getSizeExpressionInBits" makes me think that a Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale).  I would be surprised to get a pair of integers back.  Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice?  We could add a ConstantVScale or something to make it work.
>
> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
>
> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
> compile time like <4 x i32> the size would be (128, 0).
>
> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>
> When calculating the offset for memory addresses, you just need to multiply the scaled
> part by vscale and add the unscaled as is.

Ok, now I understand what you're getting at.  A ConstantExpr would
encapsulate this computation.  We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof.  I would see
VScaleConstant in that same tradition.  In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Is there anything about vscale or a scalable vector that requires a
minimum bit width?  For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type.  I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl.  Not that we make any such things anymore.  But maybe
someone else does?

>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison.  We have code here that adds ConstantExpr support to
>> ScalarEvolution.  We just didn't know if anyone else would be interested
>> in it since we added it solely for our Fortran frontend.
>
> We added a dedicated SCEV expression class for vscale instead; I suspect it works
> either way.

Yes, that's probably true.  A vscale SCEV is less invasive.

> We've tried it as both an instruction and as a 'Constant', and both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.

vscale as a Constant is interesting.  It's a target-dependent Constant
like sizeof and offsetof.  It doesn't have a statically known value and
maybe isn't "constant" across functions.  So it's a strange kind of
constant.

Ultimately whatever is easier for LLVM to analyze in the long run is
best.  Intrinsics often block optimization.  I don't know whether vscale
would be "eaiser" as a Constant or an Instruction.

>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota).  They won't fold to compile-time values but the
>> expressions could be simplified.  I haven't really thought through the
>> implications of this, just brainstorming ideas.  What does your
>> downstream compiler require in terms of constant support.  What kinds of
>> queries does it need to do?
>
> It makes things a little easier to pattern match (just looking for a constant to start
> instead of having to match multiple different forms of vscale or stepvector multiplied
> and/or added in each place you're looking for them).

Ok.  Normalization could help with this but I certainly understand the
issue.

> The bigger reason we currently depend on them being constant is that code generation
> generally looks at a single block at a time, and there are several expressions using
> vscale that we don't want to be generated in one block and passed around in a register,
> since many of the load/store addressing forms for instructions will already scale properly.

This is kind of like X86 memop folding.  If a load has multiple uses, it
won't be folded, on the theory that one load is better than many folded
loads.  If a load has exactly one use, it will fold.  There's explicit
predicate code in the X86 backend to enforce this requirement.  I
suspect if the X86 backend tried to fold a single load into multiple
places, Bad Things would happen (needed SDNodes might disappear, etc.).

Codegen probably doesn't understand non-statically-constant
ConstantExprs, since sizeof of offsetof can be resolved by the target
before instruction selection.

> We've done this downstream by having them be Constants, but if there's a good way
> of doing them with intrinsics we'd be fine with that too.

If vscale/stepvector as Constants works, it seems fine to me.

                               -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
>> I'm not so sure.  iota is a generally useful operation and scaling it to
>> various step values is also useful.  It's used often for strided memory
>> access, which would be done via gather/scatter in LLVM but generating a
>> vector GEP via stepvector would be convenient and convey more semantic
>> information than, say, loading a constant vector of indices to feed the
>> GEP.
>
> My point is that those patterns will be generated by C-level
> intrinsics or IR optimisation passes (like vectorisers), so they have
> a specific meaning in that context.
>
> What I fear is if some other pass like CSE finds the patterns out and
> common them up at the top of the function/BB and then the back-end
> loses sight of what that was and can't generate the step increment
> instruction in the end.

Got it.  Graham hit this point as well.  I took your suggestion as
"fusing" the iota/scale/offset together.  I would still want to be able
to generate things like stepvector without scaling and adding offsets (I
suppose a scale of 1 and offset of 0 would be ok, but ugly).  I don't
really care if we prevent CSE of such things.

                            -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi David,

>>> The name "getSizeExpressionInBits" makes me think that a Value
>>> expression will be returned (something like a ConstantExpr that uses
>>> vscale).  I would be surprised to get a pair of integers back.  Do
>>> clients actually need constant integer values or would a ConstantExpr
>>> sufffice?  We could add a ConstantVScale or something to make it work.
>>
>> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
>> integers representing the known-at-compile-time terms in an expression:
>> '(scaled_bits * vscale) + unscaled_bits'.
>>
>> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
>> compile time like <4 x i32> the size would be (128, 0).
>>
>> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>>
>> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>>
>> When calculating the offset for memory addresses, you just need to multiply the scaled
>> part by vscale and add the unscaled as is.
>
> Ok, now I understand what you're getting at.  A ConstantExpr would
> encapsulate this computation.  We alreay have "non-static-constant"
> values for ConstantExpr like sizeof and offsetof.  I would see
> VScaleConstant in that same tradition.  In your struct example,
> getSizeExpressionInBits would return:
>
> add(mul(256, vscale), 64)
>
> Does that satisfy your needs?

Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by vscale and
add any fixed-length size.

This is for the analysis code around the IR -- lots of code asks for the size of a Type in
bits to determine what it can do to a Value with that type. Some of them are specific to
scalar Types, like determining whether a sign/zero extend is needed. Others would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.

See 'getTypeSizeInBits' and 'getTypeStoreSizeInBits' in DataLayout -- they're used
a few hundred times throughout the codebase, and to properly support scalable
types we'd need to return something that isn't just a single integer. Since most
backends won't support scalable vectors I suggested having a 'FixedSize' method
that just returns the single integer, but it may be better to just leave the existing method
as is and create a new method with 'Scalable' or 'VariableLength' or similar in the
name to make it more obvious in common code.

There's a few places where changes in IR may be needed; 'lifetime.start' markers in
IR embed size data, and we would need to either add a scalable term to that or
find some other way of indicating the size. That can be dealt with when we try to
add support for the SVE ACLE though.

>
> Is there anything about vscale or a scalable vector that requires a
> minimum bit width?  For example, is this legal?
>
> <scalable 1 x double>
>
> I know it won't map to an SVE type.  I'm simply curious because
> traditionally Cray machines defined vectors in terms of
> machine-dependent "maxvl" with an element type, so with the above vscale
> would == maxvl.  Not that we make any such things anymore.  But maybe
> someone else does?

That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.

For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.

-Graham






>
>>> If we went the ConstantExpr route and added ConstantExpr support to
>>> ScalarEvolution, then SCEVs could be compared to do this size
>>> comparison.  We have code here that adds ConstantExpr support to
>>> ScalarEvolution.  We just didn't know if anyone else would be interested
>>> in it since we added it solely for our Fortran frontend.
>>
>> We added a dedicated SCEV expression class for vscale instead; I suspect it works
>> either way.
>
> Yes, that's probably true.  A vscale SCEV is less invasive.
>
>> We've tried it as both an instruction and as a 'Constant', and both work fine with
>> ScalarEvolution. I have not yet tried it with the intrinsic.
>
> vscale as a Constant is interesting.  It's a target-dependent Constant
> like sizeof and offsetof.  It doesn't have a statically known value and
> maybe isn't "constant" across functions.  So it's a strange kind of
> constant.
>
> Ultimately whatever is easier for LLVM to analyze in the long run is
> best.  Intrinsics often block optimization.  I don't know whether vscale
> would be "eaiser" as a Constant or an Instruction.
>
>>> As above, we could add ConstantVScale and also ConstantStepVector (or
>>> ConstantIota).  They won't fold to compile-time values but the
>>> expressions could be simplified.  I haven't really thought through the
>>> implications of this, just brainstorming ideas.  What does your
>>> downstream compiler require in terms of constant support.  What kinds of
>>> queries does it need to do?
>>
>> It makes things a little easier to pattern match (just looking for a constant to start
>> instead of having to match multiple different forms of vscale or stepvector multiplied
>> and/or added in each place you're looking for them).
>
> Ok.  Normalization could help with this but I certainly understand the
> issue.
>
>> The bigger reason we currently depend on them being constant is that code generation
>> generally looks at a single block at a time, and there are several expressions using
>> vscale that we don't want to be generated in one block and passed around in a register,
>> since many of the load/store addressing forms for instructions will already scale properly.
>
> This is kind of like X86 memop folding.  If a load has multiple uses, it
> won't be folded, on the theory that one load is better than many folded
> loads.  If a load has exactly one use, it will fold.  There's explicit
> predicate code in the X86 backend to enforce this requirement.  I
> suspect if the X86 backend tried to fold a single load into multiple
> places, Bad Things would happen (needed SDNodes might disappear, etc.).
>
> Codegen probably doesn't understand non-statically-constant
> ConstantExprs, since sizeof of offsetof can be resolved by the target
> before instruction selection.
>
>> We've done this downstream by having them be Constants, but if there's a good way
>> of doing them with intrinsics we'd be fine with that too.
>
> If vscale/stepvector as Constants works, it seems fine to me.
>
>                               -David

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Graham Hunter via llvm-dev <[hidden email]> writes:

>> Ok, now I understand what you're getting at.  A ConstantExpr would
>> encapsulate this computation.  We alreay have "non-static-constant"
>> values for ConstantExpr like sizeof and offsetof.  I would see
>> VScaleConstant in that same tradition.  In your struct example,
>> getSizeExpressionInBits would return:
>>
>> add(mul(256, vscale), 64)
>>
>> Does that satisfy your needs?
>
> Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
> isn't for expressing the size in IR, where you would indeed just multiply by vscale and
> add any fixed-length size.

Ok, thanks for clarifying.  The use of "expression" is confusing.

> This is for the analysis code around the IR -- lots of code asks for the size of a Type in
> bits to determine what it can do to a Value with that type. Some of them are specific to
> scalar Types, like determining whether a sign/zero extend is needed. Others would
> apply to vector types (including scalable vectors), such as checking whether two
> Types have the exact same size so that a bitcast can be used instead of a more
> expensive operation like copying to memory and back to convert.

If this method returns two integers, how does LLVM interpret the
comparison?  If the return value is { <unscaled>, <scaled> } then how
do, say { 1024, 0 } and { 0, 128 } compare?  Doesn't it depend on the
vscale?  They could be the same size or not, depending on the target
characteristics.

Are bitcasts between scaled types and non-scaled types disallowed?  I
could certainly see an argument for disallowing it.  I could argue that
for bitcasting purposes that the unscaled and scaled parts would have to
exactly match in order to do a legal bitcast.  Is that the intent?

>> Is there anything about vscale or a scalable vector that requires a
>> minimum bit width?  For example, is this legal?
>>
>> <scalable 1 x double>
>>
>> I know it won't map to an SVE type.  I'm simply curious because
>> traditionally Cray machines defined vectors in terms of
>> machine-dependent "maxvl" with an element type, so with the above vscale
>> would == maxvl.  Not that we make any such things anymore.  But maybe
>> someone else does?
>
> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
> length within the loop so that they can perform the last iterations of a loop within vector
> registers when there's less than a full register worth of data remaining. SVE uses
> predication (masking) to achieve the same effect.
>
> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
> predicate on all operations. This avoids needing to add a token operand to all existing
> instructions that work on vector types.

Right.  In that way the RISC V method is very much like what the old
Cray machines did with the Vector Length register.

So in LLVM IR you would have "setvl" return a predicate and then apply
that predicate to operations using the current select method?  How does
instruction selection map that back onto a simple setvl + unpredicated
vector instructions?

For conditional code both vector length and masking must be taken into
account.  If "setvl" returns a predicate then that predicate would have
to be combined in some way with the conditional predicate (typically via
an AND operation in an IR that directly supports predicates).  Since
LLVM IR doesn't have predicates _per_se_, would it turn into nested
selects or something?  Untangling that in instruction selection seems
difficult but perhaps I'm missing something.

                                 -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Hi,

> On 6 Jun 2018, at 17:36, David A. Greene <[hidden email]> wrote:
>
> Graham Hunter via llvm-dev <[hidden email]> writes:
>
>>> Ok, now I understand what you're getting at.  A ConstantExpr would
>>> encapsulate this computation.  We alreay have "non-static-constant"
>>> values for ConstantExpr like sizeof and offsetof.  I would see
>>> VScaleConstant in that same tradition.  In your struct example,
>>> getSizeExpressionInBits would return:
>>>
>>> add(mul(256, vscale), 64)
>>>
>>> Does that satisfy your needs?
>>
>> Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
>> isn't for expressing the size in IR, where you would indeed just multiply by vscale and
>> add any fixed-length size.
>
> Ok, thanks for clarifying.  The use of "expression" is confusing.
>
>> This is for the analysis code around the IR -- lots of code asks for the size of a Type in
>> bits to determine what it can do to a Value with that type. Some of them are specific to
>> scalar Types, like determining whether a sign/zero extend is needed. Others would
>> apply to vector types (including scalable vectors), such as checking whether two
>> Types have the exact same size so that a bitcast can be used instead of a more
>> expensive operation like copying to memory and back to convert.
>
> If this method returns two integers, how does LLVM interpret the
> comparison?  If the return value is { <unscaled>, <scaled> } then how
> do, say { 1024, 0 } and { 0, 128 } compare?  Doesn't it depend on the
> vscale?  They could be the same size or not, depending on the target
> characteristics.

I did have a paragraph on that in the RFC, but perhaps a list would be
a better format (assuming X,Y,etc are non-zero):

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

I don't know if we need a 'maybe' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

I think in code, we'd have an inline function to deal with the first case
and an likely-not-taken call to a separate function to handle all the
scalable cases.

> Are bitcasts between scaled types and non-scaled types disallowed?  I
> could certainly see an argument for disallowing it.  I could argue that
> for bitcasting purposes that the unscaled and scaled parts would have to
> exactly match in order to do a legal bitcast.  Is that the intent?

I would propose disallowing bitcasts, but allowing extracting a subvector
if the minimum number of scaled bits matches the number of unscaled bits.

>
>>> Is there anything about vscale or a scalable vector that requires a
>>> minimum bit width?  For example, is this legal?
>>>
>>> <scalable 1 x double>
>>>
>>> I know it won't map to an SVE type.  I'm simply curious because
>>> traditionally Cray machines defined vectors in terms of
>>> machine-dependent "maxvl" with an element type, so with the above vscale
>>> would == maxvl.  Not that we make any such things anymore.  But maybe
>>> someone else does?
>>
>> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
>> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
>> length within the loop so that they can perform the last iterations of a loop within vector
>> registers when there's less than a full register worth of data remaining. SVE uses
>> predication (masking) to achieve the same effect.
>>
>> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
>> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
>> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
>> predicate on all operations. This avoids needing to add a token operand to all existing
>> instructions that work on vector types.
>
> Right.  In that way the RISC V method is very much like what the old
> Cray machines did with the Vector Length register.
>
> So in LLVM IR you would have "setvl" return a predicate and then apply
> that predicate to operations using the current select method?  How does
> instruction selection map that back onto a simple setvl + unpredicated
> vector instructions?
>
> For conditional code both vector length and masking must be taken into
> account.  If "setvl" returns a predicate then that predicate would have
> to be combined in some way with the conditional predicate (typically via
> an AND operation in an IR that directly supports predicates).  Since
> LLVM IR doesn't have predicates _per_se_, would it turn into nested
> selects or something?  Untangling that in instruction selection seems
> difficult but perhaps I'm missing something.

My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

-Graham

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
On Fri, Jun 8, 2018 at 4:10 AM, Graham Hunter via llvm-dev <[hidden email]> wrote:
Hi,

> On 6 Jun 2018, at 17:36, David A. Greene <[hidden email]> wrote:
>
> Graham Hunter via llvm-dev <[hidden email]> writes:
>
>>> Ok, now I understand what you're getting at.  A ConstantExpr would
>>> encapsulate this computation.  We alreay have "non-static-constant"
>>> values for ConstantExpr like sizeof and offsetof.  I would see
>>> VScaleConstant in that same tradition.  In your struct example,
>>> getSizeExpressionInBits would return:
>>>
>>> add(mul(256, vscale), 64)
>>>
>>> Does that satisfy your needs?
>>
>> Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
>> isn't for expressing the size in IR, where you would indeed just multiply by vscale and
>> add any fixed-length size.
>
> Ok, thanks for clarifying.  The use of "expression" is confusing.
>
>> This is for the analysis code around the IR -- lots of code asks for the size of a Type in
>> bits to determine what it can do to a Value with that type. Some of them are specific to
>> scalar Types, like determining whether a sign/zero extend is needed. Others would
>> apply to vector types (including scalable vectors), such as checking whether two
>> Types have the exact same size so that a bitcast can be used instead of a more
>> expensive operation like copying to memory and back to convert.
>
> If this method returns two integers, how does LLVM interpret the
> comparison?  If the return value is { <unscaled>, <scaled> } then how
> do, say { 1024, 0 } and { 0, 128 } compare?  Doesn't it depend on the
> vscale?  They could be the same size or not, depending on the target
> characteristics.

I did have a paragraph on that in the RFC, but perhaps a list would be
a better format (assuming X,Y,etc are non-zero):

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

I don't know if we need a 'maybe' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

I think in code, we'd have an inline function to deal with the first case
and an likely-not-taken call to a separate function to handle all the
scalable cases.

> Are bitcasts between scaled types and non-scaled types disallowed?  I
> could certainly see an argument for disallowing it.  I could argue that
> for bitcasting purposes that the unscaled and scaled parts would have to
> exactly match in order to do a legal bitcast.  Is that the intent?

I would propose disallowing bitcasts, but allowing extracting a subvector
if the minimum number of scaled bits matches the number of unscaled bits.

>
>>> Is there anything about vscale or a scalable vector that requires a
>>> minimum bit width?  For example, is this legal?
>>>
>>> <scalable 1 x double>
>>>
>>> I know it won't map to an SVE type.  I'm simply curious because
>>> traditionally Cray machines defined vectors in terms of
>>> machine-dependent "maxvl" with an element type, so with the above vscale
>>> would == maxvl.  Not that we make any such things anymore.  But maybe
>>> someone else does?
>>
>> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
>> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
>> length within the loop so that they can perform the last iterations of a loop within vector
>> registers when there's less than a full register worth of data remaining. SVE uses
>> predication (masking) to achieve the same effect.
>>
>> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
>> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
>> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
>> predicate on all operations. This avoids needing to add a token operand to all existing
>> instructions that work on vector types.
>
> Right.  In that way the RISC V method is very much like what the old
> Cray machines did with the Vector Length register.
>
> So in LLVM IR you would have "setvl" return a predicate and then apply
> that predicate to operations using the current select method?  How does
> instruction selection map that back onto a simple setvl + unpredicated
> vector instructions?
>
> For conditional code both vector length and masking must be taken into
> account.  If "setvl" returns a predicate then that predicate would have
> to be combined in some way with the conditional predicate (typically via
> an AND operation in an IR that directly supports predicates).  Since
> LLVM IR doesn't have predicates _per_se_, would it turn into nested
> selects or something?  Untangling that in instruction selection seems
> difficult but perhaps I'm missing something.

My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

RVV is a little bit behind SVE in the process :-) On the whole it's following the style of vector processor that has had several implementations at Berkeley, dating back a decade or more. The set of operations is pretty much nailed down now, but things such as the exact instruction encodings are still in flux. There is an intention to get some experience with compilers and FPGA (at least) implementations of the proposal before ratifying it as part of the RISC-V standard. So details could well change during that period.


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <[hidden email]> wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between functions.

Alternatively, general size queries could return "incomparable" and "only if
in the same function" results in addition to smaller/larger/equal. This might
nudge code to handling all these possibilities as well as it can.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
 
+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.
 
========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?
 
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
 
Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```

  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.
 
Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Hi Robin,

> One thing I am missing is a discussion of how stack frame layout will be
> handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
> gave little details and it was a long time ago anyway. What are your current
> plans here?
Good that you're pointing this out. For now, we can be overly conservative and
assume the in-memory size of a scalable vector to be the architecturally defined
maximum. Naturally this is not very efficient as we allocate much more space
on the stack than needed.

I'll be posting a separate RFC that addresses a more optimal frame-layout
with scalable vectors in the coming weeks.

Cheers,

Sander


From: Robin Kruppe <[hidden email]>
Date: Friday, 8 June 2018 at 16:24
To: Graham Hunter <[hidden email]>
Cc: Chris Lattner <[hidden email]>, Hal Finkel <[hidden email]>, "Jones, Joel" <[hidden email]>, "[hidden email]" <[hidden email]>, Renato Golin <[hidden email]>, Kristof Beyls <[hidden email]>, Amara Emerson <[hidden email]>, Florian Hahn <[hidden email]>, Sander De Smalen <[hidden email]>, "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, Sjoerd Meijer <[hidden email]>, Sam Parker <[hidden email]>, nd <[hidden email]>
Subject: Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <mailto:[hidden email]> wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between functions.

Alternatively, general size queries could return "incomparable" and "only if
in the same function" results in addition to smaller/larger/equal. This might
nudge code to handling all these possibilities as well as it can.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
 
+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.
 
========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?
 
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
 
Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```
  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.
 
Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi Graham,
Hi David,

glad to hear other people are thinking about RVV codegen!

On 7 June 2018 at 18:10, Graham Hunter <[hidden email]> wrote:

>
> Hi,
>
> > On 6 Jun 2018, at 17:36, David A. Greene <[hidden email]> wrote:
> >
> > Graham Hunter via llvm-dev <[hidden email]> writes:
> >>> Is there anything about vscale or a scalable vector that requires a
> >>> minimum bit width?  For example, is this legal?
> >>>
> >>> <scalable 1 x double>
> >>>
> >>> I know it won't map to an SVE type.  I'm simply curious because
> >>> traditionally Cray machines defined vectors in terms of
> >>> machine-dependent "maxvl" with an element type, so with the above vscale
> >>> would == maxvl.  Not that we make any such things anymore.  But maybe
> >>> someone else does?
> >>
> >> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
> >> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
> >> length within the loop so that they can perform the last iterations of a loop within vector
> >> registers when there's less than a full register worth of data remaining. SVE uses
> >> predication (masking) to achieve the same effect.


Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
x T> is currently allowed and unlike the scalable variant that's not
even useful) and it would be the sole legal vector types in the RISCV
backend.

>
> >> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
> >> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
> >> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
> >> predicate on all operations. This avoids needing to add a token operand to all existing
> >> instructions that work on vector types.

Yes, vscale would be the *maximum* vector length (how many elements
fit into each register), not the *active* vector length (how many
elements are operated on in the current loop iteration).

This has nothing to do with tokens, though. The tokens I proposed were
to encode the fact that even 'maxvl' varies on a function by function
basis. This RFC approaches the same issue differently, but it's still
there -- in terms of this RFC, operations on scalable vectors depend
on `vscale`, which is "not necessarily [constant] across functions".
That implies, for example, that an unmasked <scalable 4 x i32> load or
store (which accesses vscale * 16 bytes of memory) can't generally be
moved from one function to another unless it's somehow ensured that
both functions will have the same vscale. For that matter, the same
restriction applies to calls to `vscale` itself.

The evolution of the active vector length is a separate problem and
one that doesn't really impact the IR type system (nor one that can
easily be solved by tokens).

> >
> > Right.  In that way the RISC V method is very much like what the old
> > Cray machines did with the Vector Length register.
> >
> > So in LLVM IR you would have "setvl" return a predicate and then apply
> > that predicate to operations using the current select method?  How does
> > instruction selection map that back onto a simple setvl + unpredicated
> > vector instructions?
> >
> > For conditional code both vector length and masking must be taken into
> > account.  If "setvl" returns a predicate then that predicate would have
> > to be combined in some way with the conditional predicate (typically via
> > an AND operation in an IR that directly supports predicates).  Since
> > LLVM IR doesn't have predicates _per_se_, would it turn into nested
> > selects or something?  Untangling that in instruction selection seems
> > difficult but perhaps I'm missing something.
>
> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
> been used, and replace the use of its value in AND operations with an
> all-true value (with constant folding to remove unnecessary ANDs) then
> replace any masked instructions (generally loads, stores, anything else
> that might generate an exception or modify state that it shouldn't) with
> target-specific nodes that understand the dynamic vlen.

I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

(1) The vl register affects not only loads, stores and other
operations with side effects, but all vector instructions, even pure
computation (and reg-reg moves, but that's not relevant for IR). An
integer vector add, for example, only computes src1[i] + src2[i] for 0
<= i < vl and the remaining elements of the destination register (from
vl upwards) are zeroed. This is very natural for strip-mined loops
(you'll never need those elements), but it means an unmasked IR level
vector add is a really bad fit for the RISC-V 'vadd' instruction.
Unless the backend can prove that only the first vl elements of the
result will ever be observed, it will have to temporarily set vl to
MAXVL so that the RVV instruction will actually compute the "full"
result. Establishing that seems like it will require at least some
custom data flow analysis, and it's unclear how robust it can be made.

(2) Failing to properly use vl for some vector operation is worse than
e.g. materializing a mask you wouldn't otherwise need. It requires
that too (if the operation is masked), but more importantly it needs
to save vl, change it to MAXVL, and finally restore the old value.
That's quite expensive: besides the ca. 3 extra instructions and the
scratch GPR required, this save/restore dance can have other nasty
effects depending on uarch style. I'd have to consult the hardware
people to be sure, but from my understanding risks include pipeline
stalls and expensive roundtrips between decoupled vector and scalar
units.

To be clear: I have not yet experimented with any of this, so I'm not
saying this is a deal breaker. A well-engineered "demanded elements"
analysis may very well be good enough in practice. But since we
broached the subject, I wanted to mention this challenge. (I'm
currently side stepping it by not using built-in vector instructions
but instead intrinsics that treat vl as magic extra state.)

> This could be part of lowering, or maybe a separate IR pass, rather than ISel.
> I *think* this will work, but if someone can come up with some IR where it
> wouldn't work then please let me know (e.g. global-state-changing instructions
> that could move out of blocks where one setvl predicate is used and into one
> where another is used).

There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).

>
>
> Unfortunately, I can't find a description of the instructions included in
> the 'V' extension in the online manual (other than setvl or configuring
> registers), so I can't tell if there's something I'm missing.

I'm very sorry for that, I know how frustrating it can be. I hope the
above gives a clearer picture of the constraints involved. Exact
instructions, let alone encodings, are still in flux as Bruce said.


Cheers,
Robin
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi Robin,

Thanks for the comments; replies inline (except for the stack regions question;
Sander is handling that side of things).

-Graham

On 8 Jun 2018, at 16:24, Robin Kruppe <[hidden email]> wrote:

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <[hidden email]> wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between functions.

Alternatively, general size queries could return "incomparable" and "only if
in the same function" results in addition to smaller/larger/equal. This might
nudge code to handling all these possibilities as well as it can.

I agree that would be nice to catch invalid comparisons; I've considered a
function that would take two 'Value*'s instead of types so that the parent
could be determined, but I don't think that works in all cases.

Using an optional 'Function*' argument in a size comparison function will
work; I think many of the existing size queries are on scalar values in code
that has already explicitly checked that it's not working on aggregate types.

It may be a better idea to catch misuses when trying to clone instructions
into a different function.


Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
 
+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Agreed, it's orthogonal to the scalable vector part.


Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.

So it does; there's a few more cases around the codebase that don't use that function
though. I may create a cleanup patch for them.

Other possibilities might be 'requires[Sign|Zero]Extension', 'requiresTrunc', 'getLargestType',
etc.


 
========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?

It's a leftover from me converting from the constant, which could be of any
integer type.

  
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
 
Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```

The latter, though in this case I updated the example IR by hand so didn't catch
that case ;)

The IR in the example patch was generated by the compiler, and does split out the
multiply into a separate instruction (which is then strength-reduced to a shift).


  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.
 
Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Graham Hunter <[hidden email]> writes:

> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.

How does this work in practice?  Let's say I populate a vector with a
splat.  Presumably, that gives me a "full length" vector.  Let's say the
type is <scalable 2 x double>.  How do I split the vector and get
something half the width?  What is its type?  How do I split it again
and get something a quarter of the width?  What is its type?  How do I
use half- and quarter-width vectors?  Must I resort to predication?

Ths split question comes into play for backward compatibility.  How
would one take a scalable vector and pass it into a NEON library?  It is
likely that some math functions, for example, will not have SVE versions
available.

Is there a way to represent "double width" vectors?  In mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on.  I guess the more
fundamental question is about how such loops should be handled.

What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero.  How would I, say,
insertelement into the upper half of the vector?  Or any arbitrary
place?  Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough?  It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

Thanks for your patience.  :)

                          -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi Robin,

responses inline.

-Graham

> On 11 Jun 2018, at 16:47, Robin Kruppe <[hidden email]> wrote:
>
> Hi Graham,
> Hi David,
>
> glad to hear other people are thinking about RVV codegen!
>
> On 7 June 2018 at 18:10, Graham Hunter <[hidden email]> wrote:
>>
>> Hi,
>>
>>> On 6 Jun 2018, at 17:36, David A. Greene <[hidden email]> wrote:
>>>
>>> Graham Hunter via llvm-dev <[hidden email]> writes:
>>>>> Is there anything about vscale or a scalable vector that requires a
>>>>> minimum bit width?  For example, is this legal?
>>>>>
>>>>> <scalable 1 x double>
>>>>>
>>>>> I know it won't map to an SVE type.  I'm simply curious because
>>>>> traditionally Cray machines defined vectors in terms of
>>>>> machine-dependent "maxvl" with an element type, so with the above vscale
>>>>> would == maxvl.  Not that we make any such things anymore.  But maybe
>>>>> someone else does?
>>>>
>>>> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
>>>> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
>>>> length within the loop so that they can perform the last iterations of a loop within vector
>>>> registers when there's less than a full register worth of data remaining. SVE uses
>>>> predication (masking) to achieve the same effect.
>
>
> Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
> x T> is currently allowed and unlike the scalable variant that's not
> even useful) and it would be the sole legal vector types in the RISCV
> backend.
>
>>
>>>> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
>>>> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
>>>> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
>>>> predicate on all operations. This avoids needing to add a token operand to all existing
>>>> instructions that work on vector types.
>
> Yes, vscale would be the *maximum* vector length (how many elements
> fit into each register), not the *active* vector length (how many
> elements are operated on in the current loop iteration).
>
> This has nothing to do with tokens, though. The tokens I proposed were
> to encode the fact that even 'maxvl' varies on a function by function
> basis. This RFC approaches the same issue differently, but it's still
> there -- in terms of this RFC, operations on scalable vectors depend
> on `vscale`, which is "not necessarily [constant] across functions".
> That implies, for example, that an unmasked <scalable 4 x i32> load or
> store (which accesses vscale * 16 bytes of memory) can't generally be
> moved from one function to another unless it's somehow ensured that
> both functions will have the same vscale. For that matter, the same
> restriction applies to calls to `vscale` itself.
>
> The evolution of the active vector length is a separate problem and
> one that doesn't really impact the IR type system (nor one that can
> easily be solved by tokens).

Agreed.

>
>>>
>>> Right.  In that way the RISC V method is very much like what the old
>>> Cray machines did with the Vector Length register.
>>>
>>> So in LLVM IR you would have "setvl" return a predicate and then apply
>>> that predicate to operations using the current select method?  How does
>>> instruction selection map that back onto a simple setvl + unpredicated
>>> vector instructions?
>>>
>>> For conditional code both vector length and masking must be taken into
>>> account.  If "setvl" returns a predicate then that predicate would have
>>> to be combined in some way with the conditional predicate (typically via
>>> an AND operation in an IR that directly supports predicates).  Since
>>> LLVM IR doesn't have predicates _per_se_, would it turn into nested
>>> selects or something?  Untangling that in instruction selection seems
>>> difficult but perhaps I'm missing something.
>>
>> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
>> been used, and replace the use of its value in AND operations with an
>> all-true value (with constant folding to remove unnecessary ANDs) then
>> replace any masked instructions (generally loads, stores, anything else
>> that might generate an exception or modify state that it shouldn't) with
>> target-specific nodes that understand the dynamic vlen.
>
> I am not quite so sure about turning the active vector length into
> just another mask. It's true that the effects on arithmetic, load,
> stores, etc. are the same as if everything executed under a mask like
> <1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
> active vector length. However, actually materializing the masks in the
> IR means the RISCV backend has to reverse-engineer what it must do
> with the vl register for any given (masked or unmasked) vector
> operation. The stakes for that are rather high, because (1) it applies
> to pretty much every single vector operation ever, and (2) when it
> fails, the codegen impact is incredibly bad.

I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.

My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of the
proposal and letting you find a better representation of active length.

> (1) The vl register affects not only loads, stores and other
> operations with side effects, but all vector instructions, even pure
> computation (and reg-reg moves, but that's not relevant for IR). An
> integer vector add, for example, only computes src1[i] + src2[i] for 0
> <= i < vl and the remaining elements of the destination register (from
> vl upwards) are zeroed. This is very natural for strip-mined loops
> (you'll never need those elements), but it means an unmasked IR level
> vector add is a really bad fit for the RISC-V 'vadd' instruction.
> Unless the backend can prove that only the first vl elements of the
> result will ever be observed, it will have to temporarily set vl to
> MAXVL so that the RVV instruction will actually compute the "full"
> result. Establishing that seems like it will require at least some
> custom data flow analysis, and it's unclear how robust it can be made.
>
> (2) Failing to properly use vl for some vector operation is worse than
> e.g. materializing a mask you wouldn't otherwise need. It requires
> that too (if the operation is masked), but more importantly it needs
> to save vl, change it to MAXVL, and finally restore the old value.
> That's quite expensive: besides the ca. 3 extra instructions and the
> scratch GPR required, this save/restore dance can have other nasty
> effects depending on uarch style. I'd have to consult the hardware
> people to be sure, but from my understanding risks include pipeline
> stalls and expensive roundtrips between decoupled vector and scalar
> units.

Ah, I hadn't appreciated you might need to save/restore the VL like that.
I'd worked through a couple of small example loops and it seemed fine,
but hadn't looked at more complicated cases.

> To be clear: I have not yet experimented with any of this, so I'm not
> saying this is a deal breaker. A well-engineered "demanded elements"
> analysis may very well be good enough in practice. But since we
> broached the subject, I wanted to mention this challenge. (I'm
> currently side stepping it by not using built-in vector instructions
> but instead intrinsics that treat vl as magic extra state.)
>
>> This could be part of lowering, or maybe a separate IR pass, rather than ISel.
>> I *think* this will work, but if someone can come up with some IR where it
>> wouldn't work then please let me know (e.g. global-state-changing instructions
>> that could move out of blocks where one setvl predicate is used and into one
>> where another is used).
>
> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need to
> "reload" it after any operation that updates vl. That seems like it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).

Ok; the fact that there's more instructions that can change vl and that you might
need to reload it is useful to know.

SVE uses predication to achieve the same via the first-faulting/no-faulting
load instructions and the ffr register.

I think SVE having 16 predicate registers (vs. 8 for RVV and AVX-512) has led
to us using the feature quite widely with our own experiments; I'll try looking for
non-predicated solutions as well when we try to expand scalable vectorization
capabilities.

>> Unfortunately, I can't find a description of the instructions included in
>> the 'V' extension in the online manual (other than setvl or configuring
>> registers), so I can't tell if there's something I'm missing.
>
> I'm very sorry for that, I know how frustrating it can be. I hope the
> above gives a clearer picture of the constraints involved. Exact
> instructions, let alone encodings, are still in flux as Bruce said.

Yes, the above information is definitely useful, even if I don't have a complete
picture yet. Thanks.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tim Northover via llvm-dev
Hi,

Just to clarify the part below about changing the proposal, this means that a <scalable n x ty> vector:

* Has a size of vscale * sizeof(ty) when reasoning about sizes within llvm
* Potentially has a smaller size of active_vscale * sizeof(ty) depending on the current
  processor state at runtime.

For SVE, active_vscale == vscale. For RVV (and potentially others), it can be smaller based
on the VL state generated by setvl or similar.

-Graham


I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.

My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of the
proposal and letting you find a better representation of active length.


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
12