[llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hi,

Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.

-Graham

===================================================
Supporting Scalable Vector Architectures in LLVM IR
===================================================

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
AArch64, featuring lane predication, scatter-gather, and speculative loads. It
is intended to scale with hardware such that the same binary running on a
processor with longer vector registers can take advantage of the increased
compute power without recompilation.

As the vector length is no longer a statically known value, the way in which the
LLVM vectorizer generates code required modifications such that certain values
are now runtime evaluated expressions. This document describes changes proposed
to LLVM that allow its vectorizer to better target SVE.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=====
Types
=====

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class. Most code that deals with vectors doesn't need
to know the exact length, but does need to know relative lengths -- e.g. get
a vector with the same number of elements but a different element type, or with
half or double the number of elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of static and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<[n x ]<m> x <type>>``

where `type` is the scalar type of each element, `m` is the minimum number of
elements, and the string literal `n x` indicates that the total number of
elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `n x` is omitted, so there is no change in the
format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same:

``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.

``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We had two alternatives in mind -- a dedicated target specific type, and a
subclass inheriting from VectorType.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation.

A subclass would take care of part of this, but would need still checks for
whether a VectorType was scalable in each place a new VectorType was required.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

=================
Runtime Constants
=================

With a scalable vector type defined, we now need a way to generate addresses for
consecutive vector values in memory and to be able to create basic constant
vector values.

For address generation, the `vscale` constant is added to represent the runtime
value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
bytes in `type` gives the total length of a scalable vector, and the backend
can pattern match to the various load and store instructions in SVE that
automatically scale with vector length.

For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used as with fixed-length vectors for a constant zero splat. This can
then be combined with `insertelement` and `shufflevector` to create arbitrary
value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, the `stepvector` constant is
added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
change the starting value a splat of the new start can be added, and changing
the step requires multiplying by a splat.

Alternatives Considered
-----------------------

We have chosen to represent these as constants to make pattern matching and
constant folding easy, particularly for the mask argument of the
`shufflevector` instruction.

Another possibility is to use intrinsics similar to the old instructions, but
currently these cannot be treated as constants. We wouldn't mind using
intrinsics here, but would want to discuss how best to support constant folding
and pattern matching without needing to add a lot more code.

=======
Example
=======

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body:                                      ; preds = %vector.body.preheader, %vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
                                [ stepvector, %vector.body.preheader ]
  %1 = add i64 %0, mul (i64 vscale, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
                                                                                        i32 mul (i32 vscale, i32 4), i32 0),
                                                                                        <n x 4 x i32> undef,
                                                                                        <n x 4 x i32> zeroinitializer)
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <n x 4 x i32>*
  store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 vscale, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=====================
Questions and Answers
=====================

Can the vector length change at runtime?
----------------------------------------

It is possible to change vector length at runtime, but there is no model defined
for how to resolve all the issues that arise when doing so. From the compiler's
point of view, it is expected that vector length can be treated as constant but
unknown.

Since there's currently no way to indicate to the compiler where a change might
occur, a programmer deliberately requesting changes to the vector length in the
middle of a function containing autovectorized code would be considered a bug.

If changing the VL at runtime becomes desireable at some point in the future,
the types and constants presented in this design would not need to change; we
would need some way of creating a barrier between sections of IR which might
have a different vscale, but that would be an addition to the current design
instead of a change.

How do we spill/fill scalable registers on the stack?
-----------------------------------------------------

SVE registers have a (partially) unknown size at build time and their associated
fill/spill instructions require an offset that is implicitly scaled by the
vector length instead of bytes or element size. To accommodate this we
created the concept of Stack Regions that are areas on the stack associated
with specific data types or register classes.

MachineFrameInfo has been extended with a list of Stack Regions that each
contain a list of associated objects. Each Stack Region controls its own
allocation, deallocation and internal offset calculations. For SVE we create
separate regions for data and predicate registers, so the exact layout does
not need to be known ahead of time, just relative offsets within regions.

Objects not belonging to a Stack Region use the default handling, so other
existing targets are not affected.

A more complete design for this will be detailed in a dedicated RFC later.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hi Graham,

Just making sure some people who voiced concerns are copied + cfe-dev.

On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
<[hidden email]> wrote:

> Hi,
>
> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>
> -Graham
>
> ===================================================
> Supporting Scalable Vector Architectures in LLVM IR
> ===================================================
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
> is intended to scale with hardware such that the same binary running on a
> processor with longer vector registers can take advantage of the increased
> compute power without recompilation.
>
> As the vector length is no longer a statically known value, the way in which the
> LLVM vectorizer generates code required modifications such that certain values
> are now runtime evaluated expressions. This document describes changes proposed
> to LLVM that allow its vectorizer to better target SVE.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =====
> Types
> =====
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class. Most code that deals with vectors doesn't need
> to know the exact length, but does need to know relative lengths -- e.g. get
> a vector with the same number of elements but a different element type, or with
> half or double the number of elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.

Does that mean that it is guaranteed that a division / multiplication
will only touch the minimum number of elements?

The cases I imagine could happen:

1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
  - does the predicate vector limit the end tail (half-scale)?
  - does it just pack more into the same vector (and update the
predicate computation)?
  - because it would have half the bytes, I assume the former, but in
theory, we could probably merge and get twice the performance, no?
  - or would this insert undefs in between? Say, a 2-way reduction:
    - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
    - In SIMD, this would use a smaller/less registers, but SVE is <n
x 128> right?
    - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
7+8, 9+10, ...>?
    - Or do you insert undefs?
      - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
      - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
  - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
  - can it be target-specific? If not, should it be made illegal?

2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
  - Would this be resolved by the target? If so, how?
  - Non-scalable vectors are just done one after the other
  - But scalable vectors have no known end-tail:
    - Creating two <n x 4 x i32> may interpolate the original data, either:
      - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
      - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
   - or:
      - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
      - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
  - the first would makes more sense, but I'm not sure the scalar
evolution would work that way out of the box

3. Widening / Narrowing should use the existing syntax, I think.


>
> This mixture of static and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<[n x ]<m> x <type>>``
>
> where `type` is the scalar type of each element, `m` is the minimum number of
> elements, and the string literal `n x` indicates that the total number of
> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
> indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `n x` is omitted, so there is no change in the
> format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same:
>
> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>
> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We had two alternatives in mind -- a dedicated target specific type, and a
> subclass inheriting from VectorType.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation.
>
> A subclass would take care of part of this, but would need still checks for
> whether a VectorType was scalable in each place a new VectorType was required.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> =================
> Runtime Constants
> =================
>
> With a scalable vector type defined, we now need a way to generate addresses for
> consecutive vector values in memory and to be able to create basic constant
> vector values.
>
> For address generation, the `vscale` constant is added to represent the runtime
> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
> bytes in `type` gives the total length of a scalable vector, and the backend
> can pattern match to the various load and store instructions in SVE that
> automatically scale with vector length.

How would this work in scalar evolution?

In a <4 x i32> loop, I know the block is 128-bits wide, so I can
predict loop carried dependencies, static ranges, etc.

IIUC, dependencies can still be avoided by predicating access to N-1
elements, but would short static ranges will have to be kept as a
loop, because unrolling is not beneficial if the run-time length is
larger than the unroll factor.

Actually, I can't think how you could possibly unroll anything with
this. I imagine two run-time checks + two SVE operations (+ predicate
fiddling) would be worse than a short sequence of 4 independent NEON
instructions, if the unroll factor is slightly larger than one
run-time vector.

This example is particular to SVE and pathological cases, but I worry
that there may be lots of cases where the new notation will make it
difficult to decide. Still, only a problem to targets that do ask for
SVE, so people are aware of the trade-offs.



> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used as with fixed-length vectors for a constant zero splat. This can
> then be combined with `insertelement` and `shufflevector` to create arbitrary
> value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, the `stepvector` constant is
> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
> change the starting value a splat of the new start can be added, and changing
> the step requires multiplying by a splat.
>
> Alternatives Considered
> -----------------------
>
> We have chosen to represent these as constants to make pattern matching and
> constant folding easy, particularly for the mask argument of the
> `shufflevector` instruction.
>
> Another possibility is to use intrinsics similar to the old instructions, but
> currently these cannot be treated as constants. We wouldn't mind using
> intrinsics here, but would want to discuss how best to support constant folding
> and pattern matching without needing to add a lot more code.
>
> =======
> Example
> =======
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>                                 [ stepvector, %vector.body.preheader ]
>   %1 = add i64 %0, mul (i64 vscale, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>                                                                                         i32 mul (i32 vscale, i32 4), i32 0),
>                                                                                         <n x 4 x i32> undef,
>                                                                                         <n x 4 x i32> zeroinitializer)

This syntax really gets me. Compare this with:

  %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>

It's going to be hard to infer anything from that syntax, which means
the resulting loop will be, for most purposes, as opaque as having an
intrinsic in there.


>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <n x 4 x i32>*
>   store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 vscale, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =====================
> Questions and Answers
> =====================
>
> Can the vector length change at runtime?
> ----------------------------------------
>
> It is possible to change vector length at runtime, but there is no model defined
> for how to resolve all the issues that arise when doing so. From the compiler's
> point of view, it is expected that vector length can be treated as constant but
> unknown.
>
> Since there's currently no way to indicate to the compiler where a change might
> occur, a programmer deliberately requesting changes to the vector length in the
> middle of a function containing autovectorized code would be considered a bug.

User error. Yes.


> If changing the VL at runtime becomes desireable at some point in the future,
> the types and constants presented in this design would not need to change; we
> would need some way of creating a barrier between sections of IR which might
> have a different vscale, but that would be an addition to the current design
> instead of a change.

No changes to IR, AFAIK. Just on how we use it.


> How do we spill/fill scalable registers on the stack?
> -----------------------------------------------------
>
> SVE registers have a (partially) unknown size at build time and their associated
> fill/spill instructions require an offset that is implicitly scaled by the
> vector length instead of bytes or element size. To accommodate this we
> created the concept of Stack Regions that are areas on the stack associated
> with specific data types or register classes.
>
> MachineFrameInfo has been extended with a list of Stack Regions that each
> contain a list of associated objects. Each Stack Region controls its own
> allocation, deallocation and internal offset calculations. For SVE we create
> separate regions for data and predicate registers, so the exact layout does
> not need to be known ahead of time, just relative offsets within regions.

Can this be mapped into IR address spaces?

If some target make use of changing VL as you describe above, this
will break catastrophically, I imagine.


> Objects not belonging to a Stack Region use the default handling, so other
> existing targets are not affected.
>
> A more complete design for this will be detailed in a dedicated RFC later.

I think this was a contentious enough point that might need a peek
into that RFC.

Targets that don't have SVE will not suffer from changes in
MachineFrameInfo, but AArch64 may, and I'm curious on how that's
solved.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hi Renato,

Thanks for taking a look. Answers inline below, let me know if I've missed something out.

-Graham


> On 5 Jun 2017, at 17:55, Renato Golin <[hidden email]> wrote:
>
> Hi Graham,
>
> Just making sure some people who voiced concerns are copied + cfe-dev.
>
> On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
> <[hidden email]> wrote:
>> Hi,
>>
>> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>>
>> -Graham
>>
>> ===================================================
>> Supporting Scalable Vector Architectures in LLVM IR
>> ===================================================
>>
>> ==========
>> Background
>> ==========
>>
>> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
>> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
>> is intended to scale with hardware such that the same binary running on a
>> processor with longer vector registers can take advantage of the increased
>> compute power without recompilation.
>>
>> As the vector length is no longer a statically known value, the way in which the
>> LLVM vectorizer generates code required modifications such that certain values
>> are now runtime evaluated expressions. This document describes changes proposed
>> to LLVM that allow its vectorizer to better target SVE.
>>
>> Documentation for SVE can be found at
>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>
>> =====
>> Types
>> =====
>>
>> To represent a vector of unknown length a boolean `Scalable` property has been
>> added to the `VectorType` class. Most code that deals with vectors doesn't need
>> to know the exact length, but does need to know relative lengths -- e.g. get
>> a vector with the same number of elements but a different element type, or with
>> half or double the number of elements.
>>
>> In order to allow code to transparently support scalable vectors, we introduce
>> an `ElementCount` class with two members:
>>
>> - `unsigned Min`: the minimum number of elements.
>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>
>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>> equal to one and thus `Min` represents the exact number of elements in the
>> vector.
>>
>> The intent for code working with vectors is to use convenience methods and avoid
>> directly dealing with the number of elements. If needed, calling
>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>> to obtain the (potentially scalable) number of elements. Overloaded division and
>> multiplication operators allow an ElementCount instance to be used in much the
>> same manner as an integer for most cases.
>
> Does that mean that it is guaranteed that a division / multiplication
> will only touch the minimum number of elements?
>
> The cases I imagine could happen:
>
> 1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
>  - does the predicate vector limit the end tail (half-scale)?
>  - does it just pack more into the same vector (and update the
> predicate computation)?
>  - because it would have half the bytes, I assume the former, but in
> theory, we could probably merge and get twice the performance, no?
>  - or would this insert undefs in between? Say, a 2-way reduction:
>    - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
>    - In SIMD, this would use a smaller/less registers, but SVE is <n
> x 128> right?
>    - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
> 7+8, 9+10, ...>?
>    - Or do you insert undefs?
>      - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
>      - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
>  - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
>  - can it be target-specific? If not, should it be made illegal?
>
> 2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
>  - Would this be resolved by the target? If so, how?
>  - Non-scalable vectors are just done one after the other
>  - But scalable vectors have no known end-tail:
>    - Creating two <n x 4 x i32> may interpolate the original data, either:
>      - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
>      - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
>   - or:
>      - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
>      - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
>  - the first would makes more sense, but I'm not sure the scalar
> evolution would work that way out of the box
>
> 3. Widening / Narrowing should use the existing syntax, I think.

Yes, it will only operate on the correct number of elements.

So SelectionDAG legalization of target-unsupported IR types happens in
exactly the same way it does for fixed-length types, e.g.
 * The <n x 8 x i32> above would be split into two <n x 4 x i32> values.
 * The <n x 2 x i32> above would be extended to an <n x 2 x i64> value.

When splitting, the linear order of elements would be preserved, so given a
vector like <1, 2, 3, ... 2n> you would end up with
 <1, 2, ... n> and <n+1, n+2, ... 2n>.
If you need a different arrangement, the shufflevector instruction can be used
with appropriate use of stepvector and vscale to generate masks.

The IR predicates for a given scalable vector (for select) just match the
number of elements, e.g. <n x 8 x i1> for the <n x 8 x i32>, and would be
split along with it during legalization.

As you say, the main legal types for SVE are based around multiples of
128b base types, but for some types we've had to use predication to help.
An <n x 2 x f32> can't be extended to use f64s, but when generating code
we can create a predicate for 64b element types and use that with 32b float
instructions, so that the vector interleaves f32 values with 32bit undefs
and gives you an effective <n x 2 x f32> vector.

For horizontal pair operations, you would use shuffle instructions to align
the operands in matching lanes for two vectors and then perform the operation.

>
>
>>
>> This mixture of static and runtime quantities allow us to reason about the
>> relationship between different scalable vector types without knowing their
>> exact length.
>>
>> IR Textual Form
>> ---------------
>>
>> The textual form for a scalable vector is:
>>
>> ``<[n x ]<m> x <type>>``
>>
>> where `type` is the scalar type of each element, `m` is the minimum number of
>> elements, and the string literal `n x` indicates that the total number of
>> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
>> indicating that the vector is scalable, and could be substituted by another.
>> For fixed-length vectors, the `n x` is omitted, so there is no change in the
>> format for existing vectors.
>>
>> Scalable vectors with the same `Min` value have the same number of elements, and
>> the same number of bytes if `Min * sizeof(type)` is the same:
>>
>> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>>
>> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>>
>> IR Bitcode Form
>> ---------------
>>
>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>> type record. If the field is not present the type will default to a fixed-length
>> vector type, preserving backwards compatibility.
>>
>> Alternatives Considered
>> -----------------------
>>
>> We had two alternatives in mind -- a dedicated target specific type, and a
>> subclass inheriting from VectorType.
>>
>> A dedicated target type would either need to extend all existing passes that
>> work with vectors to recognize the new type, or to duplicate all that code
>> in order to get reasonable code generation.
>>
>> A subclass would take care of part of this, but would need still checks for
>> whether a VectorType was scalable in each place a new VectorType was required.
>>
>> Although our current solution will need to change some of the code that creates
>> new VectorTypes, much of that code doesn't need to care about whether the types
>> are scalable or not -- they can use preexisting methods like
>> `getHalfElementsVectorType`. If the code is a little more complex,
>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>> the number of elements.
>>
>> =================
>> Runtime Constants
>> =================
>>
>> With a scalable vector type defined, we now need a way to generate addresses for
>> consecutive vector values in memory and to be able to create basic constant
>> vector values.
>>
>> For address generation, the `vscale` constant is added to represent the runtime
>> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
>> bytes in `type` gives the total length of a scalable vector, and the backend
>> can pattern match to the various load and store instructions in SVE that
>> automatically scale with vector length.
>
> How would this work in scalar evolution?
>
> In a <4 x i32> loop, I know the block is 128-bits wide, so I can
> predict loop carried dependencies, static ranges, etc.
>
> IIUC, dependencies can still be avoided by predicating access to N-1
> elements, but would short static ranges will have to be kept as a
> loop, because unrolling is not beneficial if the run-time length is
> larger than the unroll factor.
>
> Actually, I can't think how you could possibly unroll anything with
> this. I imagine two run-time checks + two SVE operations (+ predicate
> fiddling) would be worse than a short sequence of 4 independent NEON
> instructions, if the unroll factor is slightly larger than one
> run-time vector.
>
> This example is particular to SVE and pathological cases, but I worry
> that there may be lots of cases where the new notation will make it
> difficult to decide. Still, only a problem to targets that do ask for
> SVE, so people are aware of the trade-offs.

There's many more cases for SVE where vectors 'may' overlap, since we don't
have an exact size; for now, comparing size expressions featuring vscale
will be overly cautious and generate very large ranges. This will cause
extra checks to be planted for vectorized code that may not be needed,
but will get us running vectorized code safely.

Eventually, we'd like to upstream our work on controlling loops via
predication, but that is a separate discussion (and one that will affect
more people, since there's at least AVX-512 with a similar feature). For
now we'd just like to get basic SVE support in.

Unrolling and vectorizing does indeed have higher overhead when using VLA
programming, so the cost model may need to be adjusted to account for that.
We made sure it works for our downstream compiler but haven't used it for
performance modelling.


>
>
>
>> For constant vector values, we cannot specify all the elements as we can for
>> fixed-length vectors; fortunately only a small number of easily synthesized
>> patterns are required for autovectorization. The `zeroinitializer` constant
>> can be used as with fixed-length vectors for a constant zero splat. This can
>> then be combined with `insertelement` and `shufflevector` to create arbitrary
>> value splats in the same manner as fixed-length vectors.
>>
>> For constants consisting of a sequence of values, the `stepvector` constant is
>> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
>> change the starting value a splat of the new start can be added, and changing
>> the step requires multiplying by a splat.
>>
>> Alternatives Considered
>> -----------------------
>>
>> We have chosen to represent these as constants to make pattern matching and
>> constant folding easy, particularly for the mask argument of the
>> `shufflevector` instruction.
>>
>> Another possibility is to use intrinsics similar to the old instructions, but
>> currently these cannot be treated as constants. We wouldn't mind using
>> intrinsics here, but would want to discuss how best to support constant folding
>> and pattern matching without needing to add a lot more code.
>>
>> =======
>> Example
>> =======
>>
>> The following example shows a simple C loop which assigns the array index to
>> the array elements matching that index. The IR shows how vscale and stepvector
>> are used to create the needed values and to advance the index variable in the
>> loop.
>>
>> C Code
>> ------
>>
>> ``
>> void IdentityArrayInit(int *a, int count) {
>>  for (int i = 0; i < count; ++i)
>>    a[i] = i;
>> }
>> ``
>>
>> Scalable IR Vector Body
>> -----------------------
>>
>> ``
>> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>
>>           ;; stepvector used for index identity on entry to loop body ;;
>>  %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>>                                [ stepvector, %vector.body.preheader ]
>>  %1 = add i64 %0, mul (i64 vscale, i64 4)
>>
>>           ;; vscale splat used to increment identity vector ;;
>>  %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>>                                                                                        i32 mul (i32 vscale, i32 4), i32 0),
>>                                                                                        <n x 4 x i32> undef,
>>                                                                                        <n x 4 x i32> zeroinitializer)
>
> This syntax really gets me. Compare this with:
>
>  %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>
>
> It's going to be hard to infer anything from that syntax, which means
> the resulting loop will be, for most purposes, as opaque as having an
> intrinsic in there.

It looks nice for fixed-length when the element values are immediate constants.
If instead of '4' it was a value from an argument or loaded from memory, then it
would look similar -- that's how splats are represented in IR, and using an
intrinsic wouldn't change that. Existing optimizations will recognize the
splat in this form.

>
>
>>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>  %3 = bitcast i32* %2 to <n x 4 x i32>*
>>  store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>>
>>           ;; vscale used to increment loop index
>>  %index.next = add i64 %index, mul (i64 vscale, i64 4)
>>  %4 = icmp eq i64 %index.next, %n.vec
>>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>> ``
>>
>> =====================
>> Questions and Answers
>> =====================
>>
>> Can the vector length change at runtime?
>> ----------------------------------------
>>
>> It is possible to change vector length at runtime, but there is no model defined
>> for how to resolve all the issues that arise when doing so. From the compiler's
>> point of view, it is expected that vector length can be treated as constant but
>> unknown.
>>
>> Since there's currently no way to indicate to the compiler where a change might
>> occur, a programmer deliberately requesting changes to the vector length in the
>> middle of a function containing autovectorized code would be considered a bug.
>
> User error. Yes.
>
>
>> If changing the VL at runtime becomes desireable at some point in the future,
>> the types and constants presented in this design would not need to change; we
>> would need some way of creating a barrier between sections of IR which might
>> have a different vscale, but that would be an addition to the current design
>> instead of a change.
>
> No changes to IR, AFAIK. Just on how we use it.
>
>
>> How do we spill/fill scalable registers on the stack?
>> -----------------------------------------------------
>>
>> SVE registers have a (partially) unknown size at build time and their associated
>> fill/spill instructions require an offset that is implicitly scaled by the
>> vector length instead of bytes or element size. To accommodate this we
>> created the concept of Stack Regions that are areas on the stack associated
>> with specific data types or register classes.
>>
>> MachineFrameInfo has been extended with a list of Stack Regions that each
>> contain a list of associated objects. Each Stack Region controls its own
>> allocation, deallocation and internal offset calculations. For SVE we create
>> separate regions for data and predicate registers, so the exact layout does
>> not need to be known ahead of time, just relative offsets within regions.
>
> Can this be mapped into IR address spaces?

I don't think these need to map into separate address spaces; we haven't done
any investigation along those lines as the default single address space works
fine with this.

>
> If some target make use of changing VL as you describe above, this
> will break catastrophically, I imagine.

Yes; there's a great many problems with changing VL at runtime, which is
why we consider it a bug right now. I'm not aware of anyone actually asking
for the functionality to be supported.

>
>
>> Objects not belonging to a Stack Region use the default handling, so other
>> existing targets are not affected.
>>
>> A more complete design for this will be detailed in a dedicated RFC later.
>
> I think this was a contentious enough point that might need a peek
> into that RFC.
>
> Targets that don't have SVE will not suffer from changes in
> MachineFrameInfo, but AArch64 may, and I'm curious on how that's
> solved.

Noted. We'll try and schedule some time to write it up soonish.

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Chris, Chandler, ping. Any comments?

On 7 June 2017 at 13:52, Graham Hunter <[hidden email]> wrote:

> Hi Renato,
>
> Thanks for taking a look. Answers inline below, let me know if I've missed something out.
>
> -Graham
>
>
>> On 5 Jun 2017, at 17:55, Renato Golin <[hidden email]> wrote:
>>
>> Hi Graham,
>>
>> Just making sure some people who voiced concerns are copied + cfe-dev.
>>
>> On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>>>
>>> -Graham
>>>
>>> ===================================================
>>> Supporting Scalable Vector Architectures in LLVM IR
>>> ===================================================
>>>
>>> ==========
>>> Background
>>> ==========
>>>
>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
>>> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
>>> is intended to scale with hardware such that the same binary running on a
>>> processor with longer vector registers can take advantage of the increased
>>> compute power without recompilation.
>>>
>>> As the vector length is no longer a statically known value, the way in which the
>>> LLVM vectorizer generates code required modifications such that certain values
>>> are now runtime evaluated expressions. This document describes changes proposed
>>> to LLVM that allow its vectorizer to better target SVE.
>>>
>>> Documentation for SVE can be found at
>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>
>>> =====
>>> Types
>>> =====
>>>
>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>> added to the `VectorType` class. Most code that deals with vectors doesn't need
>>> to know the exact length, but does need to know relative lengths -- e.g. get
>>> a vector with the same number of elements but a different element type, or with
>>> half or double the number of elements.
>>>
>>> In order to allow code to transparently support scalable vectors, we introduce
>>> an `ElementCount` class with two members:
>>>
>>> - `unsigned Min`: the minimum number of elements.
>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>
>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>> equal to one and thus `Min` represents the exact number of elements in the
>>> vector.
>>>
>>> The intent for code working with vectors is to use convenience methods and avoid
>>> directly dealing with the number of elements. If needed, calling
>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>> multiplication operators allow an ElementCount instance to be used in much the
>>> same manner as an integer for most cases.
>>
>> Does that mean that it is guaranteed that a division / multiplication
>> will only touch the minimum number of elements?
>>
>> The cases I imagine could happen:
>>
>> 1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
>>  - does the predicate vector limit the end tail (half-scale)?
>>  - does it just pack more into the same vector (and update the
>> predicate computation)?
>>  - because it would have half the bytes, I assume the former, but in
>> theory, we could probably merge and get twice the performance, no?
>>  - or would this insert undefs in between? Say, a 2-way reduction:
>>    - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
>>    - In SIMD, this would use a smaller/less registers, but SVE is <n
>> x 128> right?
>>    - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
>> 7+8, 9+10, ...>?
>>    - Or do you insert undefs?
>>      - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
>>      - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
>>  - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
>>  - can it be target-specific? If not, should it be made illegal?
>>
>> 2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
>>  - Would this be resolved by the target? If so, how?
>>  - Non-scalable vectors are just done one after the other
>>  - But scalable vectors have no known end-tail:
>>    - Creating two <n x 4 x i32> may interpolate the original data, either:
>>      - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
>>      - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
>>   - or:
>>      - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
>>      - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
>>  - the first would makes more sense, but I'm not sure the scalar
>> evolution would work that way out of the box
>>
>> 3. Widening / Narrowing should use the existing syntax, I think.
>
> Yes, it will only operate on the correct number of elements.
>
> So SelectionDAG legalization of target-unsupported IR types happens in
> exactly the same way it does for fixed-length types, e.g.
>  * The <n x 8 x i32> above would be split into two <n x 4 x i32> values.
>  * The <n x 2 x i32> above would be extended to an <n x 2 x i64> value.
>
> When splitting, the linear order of elements would be preserved, so given a
> vector like <1, 2, 3, ... 2n> you would end up with
>  <1, 2, ... n> and <n+1, n+2, ... 2n>.
> If you need a different arrangement, the shufflevector instruction can be used
> with appropriate use of stepvector and vscale to generate masks.
>
> The IR predicates for a given scalable vector (for select) just match the
> number of elements, e.g. <n x 8 x i1> for the <n x 8 x i32>, and would be
> split along with it during legalization.
>
> As you say, the main legal types for SVE are based around multiples of
> 128b base types, but for some types we've had to use predication to help.
> An <n x 2 x f32> can't be extended to use f64s, but when generating code
> we can create a predicate for 64b element types and use that with 32b float
> instructions, so that the vector interleaves f32 values with 32bit undefs
> and gives you an effective <n x 2 x f32> vector.
>
> For horizontal pair operations, you would use shuffle instructions to align
> the operands in matching lanes for two vectors and then perform the operation.
>>
>>
>>>
>>> This mixture of static and runtime quantities allow us to reason about the
>>> relationship between different scalable vector types without knowing their
>>> exact length.
>>>
>>> IR Textual Form
>>> ---------------
>>>
>>> The textual form for a scalable vector is:
>>>
>>> ``<[n x ]<m> x <type>>``
>>>
>>> where `type` is the scalar type of each element, `m` is the minimum number of
>>> elements, and the string literal `n x` indicates that the total number of
>>> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
>>> indicating that the vector is scalable, and could be substituted by another.
>>> For fixed-length vectors, the `n x` is omitted, so there is no change in the
>>> format for existing vectors.
>>>
>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>> the same number of bytes if `Min * sizeof(type)` is the same:
>>>
>>> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>>>
>>> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>>>
>>> IR Bitcode Form
>>> ---------------
>>>
>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>> type record. If the field is not present the type will default to a fixed-length
>>> vector type, preserving backwards compatibility.
>>>
>>> Alternatives Considered
>>> -----------------------
>>>
>>> We had two alternatives in mind -- a dedicated target specific type, and a
>>> subclass inheriting from VectorType.
>>>
>>> A dedicated target type would either need to extend all existing passes that
>>> work with vectors to recognize the new type, or to duplicate all that code
>>> in order to get reasonable code generation.
>>>
>>> A subclass would take care of part of this, but would need still checks for
>>> whether a VectorType was scalable in each place a new VectorType was required.
>>>
>>> Although our current solution will need to change some of the code that creates
>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>> are scalable or not -- they can use preexisting methods like
>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>> the number of elements.
>>>
>>> =================
>>> Runtime Constants
>>> =================
>>>
>>> With a scalable vector type defined, we now need a way to generate addresses for
>>> consecutive vector values in memory and to be able to create basic constant
>>> vector values.
>>>
>>> For address generation, the `vscale` constant is added to represent the runtime
>>> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
>>> bytes in `type` gives the total length of a scalable vector, and the backend
>>> can pattern match to the various load and store instructions in SVE that
>>> automatically scale with vector length.
>>
>> How would this work in scalar evolution?
>>
>> In a <4 x i32> loop, I know the block is 128-bits wide, so I can
>> predict loop carried dependencies, static ranges, etc.
>>
>> IIUC, dependencies can still be avoided by predicating access to N-1
>> elements, but would short static ranges will have to be kept as a
>> loop, because unrolling is not beneficial if the run-time length is
>> larger than the unroll factor.
>>
>> Actually, I can't think how you could possibly unroll anything with
>> this. I imagine two run-time checks + two SVE operations (+ predicate
>> fiddling) would be worse than a short sequence of 4 independent NEON
>> instructions, if the unroll factor is slightly larger than one
>> run-time vector.
>>
>> This example is particular to SVE and pathological cases, but I worry
>> that there may be lots of cases where the new notation will make it
>> difficult to decide. Still, only a problem to targets that do ask for
>> SVE, so people are aware of the trade-offs.
>
> There's many more cases for SVE where vectors 'may' overlap, since we don't
> have an exact size; for now, comparing size expressions featuring vscale
> will be overly cautious and generate very large ranges. This will cause
> extra checks to be planted for vectorized code that may not be needed,
> but will get us running vectorized code safely.
>
> Eventually, we'd like to upstream our work on controlling loops via
> predication, but that is a separate discussion (and one that will affect
> more people, since there's at least AVX-512 with a similar feature). For
> now we'd just like to get basic SVE support in.
>
> Unrolling and vectorizing does indeed have higher overhead when using VLA
> programming, so the cost model may need to be adjusted to account for that.
> We made sure it works for our downstream compiler but haven't used it for
> performance modelling.
>
>
>>
>>
>>
>>> For constant vector values, we cannot specify all the elements as we can for
>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>> can be used as with fixed-length vectors for a constant zero splat. This can
>>> then be combined with `insertelement` and `shufflevector` to create arbitrary
>>> value splats in the same manner as fixed-length vectors.
>>>
>>> For constants consisting of a sequence of values, the `stepvector` constant is
>>> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
>>> change the starting value a splat of the new start can be added, and changing
>>> the step requires multiplying by a splat.
>>>
>>> Alternatives Considered
>>> -----------------------
>>>
>>> We have chosen to represent these as constants to make pattern matching and
>>> constant folding easy, particularly for the mask argument of the
>>> `shufflevector` instruction.
>>>
>>> Another possibility is to use intrinsics similar to the old instructions, but
>>> currently these cannot be treated as constants. We wouldn't mind using
>>> intrinsics here, but would want to discuss how best to support constant folding
>>> and pattern matching without needing to add a lot more code.
>>>
>>> =======
>>> Example
>>> =======
>>>
>>> The following example shows a simple C loop which assigns the array index to
>>> the array elements matching that index. The IR shows how vscale and stepvector
>>> are used to create the needed values and to advance the index variable in the
>>> loop.
>>>
>>> C Code
>>> ------
>>>
>>> ``
>>> void IdentityArrayInit(int *a, int count) {
>>>  for (int i = 0; i < count; ++i)
>>>    a[i] = i;
>>> }
>>> ``
>>>
>>> Scalable IR Vector Body
>>> -----------------------
>>>
>>> ``
>>> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>>>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>
>>>           ;; stepvector used for index identity on entry to loop body ;;
>>>  %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>>>                                [ stepvector, %vector.body.preheader ]
>>>  %1 = add i64 %0, mul (i64 vscale, i64 4)
>>>
>>>           ;; vscale splat used to increment identity vector ;;
>>>  %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>>>                                                                                        i32 mul (i32 vscale, i32 4), i32 0),
>>>                                                                                        <n x 4 x i32> undef,
>>>                                                                                        <n x 4 x i32> zeroinitializer)
>>
>> This syntax really gets me. Compare this with:
>>
>>  %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>
>>
>> It's going to be hard to infer anything from that syntax, which means
>> the resulting loop will be, for most purposes, as opaque as having an
>> intrinsic in there.
>
> It looks nice for fixed-length when the element values are immediate constants.
> If instead of '4' it was a value from an argument or loaded from memory, then it
> would look similar -- that's how splats are represented in IR, and using an
> intrinsic wouldn't change that. Existing optimizations will recognize the
> splat in this form.
>
>>
>>
>>>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>  %3 = bitcast i32* %2 to <n x 4 x i32>*
>>>  store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>>>
>>>           ;; vscale used to increment loop index
>>>  %index.next = add i64 %index, mul (i64 vscale, i64 4)
>>>  %4 = icmp eq i64 %index.next, %n.vec
>>>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>> ``
>>>
>>> =====================
>>> Questions and Answers
>>> =====================
>>>
>>> Can the vector length change at runtime?
>>> ----------------------------------------
>>>
>>> It is possible to change vector length at runtime, but there is no model defined
>>> for how to resolve all the issues that arise when doing so. From the compiler's
>>> point of view, it is expected that vector length can be treated as constant but
>>> unknown.
>>>
>>> Since there's currently no way to indicate to the compiler where a change might
>>> occur, a programmer deliberately requesting changes to the vector length in the
>>> middle of a function containing autovectorized code would be considered a bug.
>>
>> User error. Yes.
>>
>>
>>> If changing the VL at runtime becomes desireable at some point in the future,
>>> the types and constants presented in this design would not need to change; we
>>> would need some way of creating a barrier between sections of IR which might
>>> have a different vscale, but that would be an addition to the current design
>>> instead of a change.
>>
>> No changes to IR, AFAIK. Just on how we use it.
>>
>>
>>> How do we spill/fill scalable registers on the stack?
>>> -----------------------------------------------------
>>>
>>> SVE registers have a (partially) unknown size at build time and their associated
>>> fill/spill instructions require an offset that is implicitly scaled by the
>>> vector length instead of bytes or element size. To accommodate this we
>>> created the concept of Stack Regions that are areas on the stack associated
>>> with specific data types or register classes.
>>>
>>> MachineFrameInfo has been extended with a list of Stack Regions that each
>>> contain a list of associated objects. Each Stack Region controls its own
>>> allocation, deallocation and internal offset calculations. For SVE we create
>>> separate regions for data and predicate registers, so the exact layout does
>>> not need to be known ahead of time, just relative offsets within regions.
>>
>> Can this be mapped into IR address spaces?
>
> I don't think these need to map into separate address spaces; we haven't done
> any investigation along those lines as the default single address space works
> fine with this.
>
>>
>> If some target make use of changing VL as you describe above, this
>> will break catastrophically, I imagine.
>
> Yes; there's a great many problems with changing VL at runtime, which is
> why we consider it a bug right now. I'm not aware of anyone actually asking
> for the functionality to be supported.
>
>>
>>
>>> Objects not belonging to a Stack Region use the default handling, so other
>>> existing targets are not affected.
>>>
>>> A more complete design for this will be detailed in a dedicated RFC later.
>>
>> I think this was a contentious enough point that might need a peek
>> into that RFC.
>>
>> Targets that don't have SVE will not suffer from changes in
>> MachineFrameInfo, but AArch64 may, and I'm curious on how that's
>> solved.
>
> Noted. We'll try and schedule some time to write it up soonish.
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
I’m sorry for the delay Renato.  I’ll take a look and get back to you in the next couple of days

-Chris

> On Jun 26, 2017, at 7:01 AM, Renato Golin via llvm-dev <[hidden email]> wrote:
>
> Chris, Chandler, ping. Any comments?
>
> On 7 June 2017 at 13:52, Graham Hunter <[hidden email]> wrote:
>> Hi Renato,
>>
>> Thanks for taking a look. Answers inline below, let me know if I've missed something out.
>>
>> -Graham
>>
>>
>>> On 5 Jun 2017, at 17:55, Renato Golin <[hidden email]> wrote:
>>>
>>> Hi Graham,
>>>
>>> Just making sure some people who voiced concerns are copied + cfe-dev.
>>>
>>> On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
>>> <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>>>>
>>>> -Graham
>>>>
>>>> ===================================================
>>>> Supporting Scalable Vector Architectures in LLVM IR
>>>> ===================================================
>>>>
>>>> ==========
>>>> Background
>>>> ==========
>>>>
>>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
>>>> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
>>>> is intended to scale with hardware such that the same binary running on a
>>>> processor with longer vector registers can take advantage of the increased
>>>> compute power without recompilation.
>>>>
>>>> As the vector length is no longer a statically known value, the way in which the
>>>> LLVM vectorizer generates code required modifications such that certain values
>>>> are now runtime evaluated expressions. This document describes changes proposed
>>>> to LLVM that allow its vectorizer to better target SVE.
>>>>
>>>> Documentation for SVE can be found at
>>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>>
>>>> =====
>>>> Types
>>>> =====
>>>>
>>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>>> added to the `VectorType` class. Most code that deals with vectors doesn't need
>>>> to know the exact length, but does need to know relative lengths -- e.g. get
>>>> a vector with the same number of elements but a different element type, or with
>>>> half or double the number of elements.
>>>>
>>>> In order to allow code to transparently support scalable vectors, we introduce
>>>> an `ElementCount` class with two members:
>>>>
>>>> - `unsigned Min`: the minimum number of elements.
>>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>>
>>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>>> equal to one and thus `Min` represents the exact number of elements in the
>>>> vector.
>>>>
>>>> The intent for code working with vectors is to use convenience methods and avoid
>>>> directly dealing with the number of elements. If needed, calling
>>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>>> multiplication operators allow an ElementCount instance to be used in much the
>>>> same manner as an integer for most cases.
>>>
>>> Does that mean that it is guaranteed that a division / multiplication
>>> will only touch the minimum number of elements?
>>>
>>> The cases I imagine could happen:
>>>
>>> 1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
>>> - does the predicate vector limit the end tail (half-scale)?
>>> - does it just pack more into the same vector (and update the
>>> predicate computation)?
>>> - because it would have half the bytes, I assume the former, but in
>>> theory, we could probably merge and get twice the performance, no?
>>> - or would this insert undefs in between? Say, a 2-way reduction:
>>>   - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
>>>   - In SIMD, this would use a smaller/less registers, but SVE is <n
>>> x 128> right?
>>>   - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
>>> 7+8, 9+10, ...>?
>>>   - Or do you insert undefs?
>>>     - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
>>>     - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
>>> - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
>>> - can it be target-specific? If not, should it be made illegal?
>>>
>>> 2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
>>> - Would this be resolved by the target? If so, how?
>>> - Non-scalable vectors are just done one after the other
>>> - But scalable vectors have no known end-tail:
>>>   - Creating two <n x 4 x i32> may interpolate the original data, either:
>>>     - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
>>>     - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
>>>  - or:
>>>     - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
>>>     - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
>>> - the first would makes more sense, but I'm not sure the scalar
>>> evolution would work that way out of the box
>>>
>>> 3. Widening / Narrowing should use the existing syntax, I think.
>>
>> Yes, it will only operate on the correct number of elements.
>>
>> So SelectionDAG legalization of target-unsupported IR types happens in
>> exactly the same way it does for fixed-length types, e.g.
>> * The <n x 8 x i32> above would be split into two <n x 4 x i32> values.
>> * The <n x 2 x i32> above would be extended to an <n x 2 x i64> value.
>>
>> When splitting, the linear order of elements would be preserved, so given a
>> vector like <1, 2, 3, ... 2n> you would end up with
>> <1, 2, ... n> and <n+1, n+2, ... 2n>.
>> If you need a different arrangement, the shufflevector instruction can be used
>> with appropriate use of stepvector and vscale to generate masks.
>>
>> The IR predicates for a given scalable vector (for select) just match the
>> number of elements, e.g. <n x 8 x i1> for the <n x 8 x i32>, and would be
>> split along with it during legalization.
>>
>> As you say, the main legal types for SVE are based around multiples of
>> 128b base types, but for some types we've had to use predication to help.
>> An <n x 2 x f32> can't be extended to use f64s, but when generating code
>> we can create a predicate for 64b element types and use that with 32b float
>> instructions, so that the vector interleaves f32 values with 32bit undefs
>> and gives you an effective <n x 2 x f32> vector.
>>
>> For horizontal pair operations, you would use shuffle instructions to align
>> the operands in matching lanes for two vectors and then perform the operation.
>>>
>>>
>>>>
>>>> This mixture of static and runtime quantities allow us to reason about the
>>>> relationship between different scalable vector types without knowing their
>>>> exact length.
>>>>
>>>> IR Textual Form
>>>> ---------------
>>>>
>>>> The textual form for a scalable vector is:
>>>>
>>>> ``<[n x ]<m> x <type>>``
>>>>
>>>> where `type` is the scalar type of each element, `m` is the minimum number of
>>>> elements, and the string literal `n x` indicates that the total number of
>>>> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
>>>> indicating that the vector is scalable, and could be substituted by another.
>>>> For fixed-length vectors, the `n x` is omitted, so there is no change in the
>>>> format for existing vectors.
>>>>
>>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>>> the same number of bytes if `Min * sizeof(type)` is the same:
>>>>
>>>> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>>>>
>>>> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>>>>
>>>> IR Bitcode Form
>>>> ---------------
>>>>
>>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>>> type record. If the field is not present the type will default to a fixed-length
>>>> vector type, preserving backwards compatibility.
>>>>
>>>> Alternatives Considered
>>>> -----------------------
>>>>
>>>> We had two alternatives in mind -- a dedicated target specific type, and a
>>>> subclass inheriting from VectorType.
>>>>
>>>> A dedicated target type would either need to extend all existing passes that
>>>> work with vectors to recognize the new type, or to duplicate all that code
>>>> in order to get reasonable code generation.
>>>>
>>>> A subclass would take care of part of this, but would need still checks for
>>>> whether a VectorType was scalable in each place a new VectorType was required.
>>>>
>>>> Although our current solution will need to change some of the code that creates
>>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>>> are scalable or not -- they can use preexisting methods like
>>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>>> the number of elements.
>>>>
>>>> =================
>>>> Runtime Constants
>>>> =================
>>>>
>>>> With a scalable vector type defined, we now need a way to generate addresses for
>>>> consecutive vector values in memory and to be able to create basic constant
>>>> vector values.
>>>>
>>>> For address generation, the `vscale` constant is added to represent the runtime
>>>> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
>>>> bytes in `type` gives the total length of a scalable vector, and the backend
>>>> can pattern match to the various load and store instructions in SVE that
>>>> automatically scale with vector length.
>>>
>>> How would this work in scalar evolution?
>>>
>>> In a <4 x i32> loop, I know the block is 128-bits wide, so I can
>>> predict loop carried dependencies, static ranges, etc.
>>>
>>> IIUC, dependencies can still be avoided by predicating access to N-1
>>> elements, but would short static ranges will have to be kept as a
>>> loop, because unrolling is not beneficial if the run-time length is
>>> larger than the unroll factor.
>>>
>>> Actually, I can't think how you could possibly unroll anything with
>>> this. I imagine two run-time checks + two SVE operations (+ predicate
>>> fiddling) would be worse than a short sequence of 4 independent NEON
>>> instructions, if the unroll factor is slightly larger than one
>>> run-time vector.
>>>
>>> This example is particular to SVE and pathological cases, but I worry
>>> that there may be lots of cases where the new notation will make it
>>> difficult to decide. Still, only a problem to targets that do ask for
>>> SVE, so people are aware of the trade-offs.
>>
>> There's many more cases for SVE where vectors 'may' overlap, since we don't
>> have an exact size; for now, comparing size expressions featuring vscale
>> will be overly cautious and generate very large ranges. This will cause
>> extra checks to be planted for vectorized code that may not be needed,
>> but will get us running vectorized code safely.
>>
>> Eventually, we'd like to upstream our work on controlling loops via
>> predication, but that is a separate discussion (and one that will affect
>> more people, since there's at least AVX-512 with a similar feature). For
>> now we'd just like to get basic SVE support in.
>>
>> Unrolling and vectorizing does indeed have higher overhead when using VLA
>> programming, so the cost model may need to be adjusted to account for that.
>> We made sure it works for our downstream compiler but haven't used it for
>> performance modelling.
>>
>>
>>>
>>>
>>>
>>>> For constant vector values, we cannot specify all the elements as we can for
>>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>>> can be used as with fixed-length vectors for a constant zero splat. This can
>>>> then be combined with `insertelement` and `shufflevector` to create arbitrary
>>>> value splats in the same manner as fixed-length vectors.
>>>>
>>>> For constants consisting of a sequence of values, the `stepvector` constant is
>>>> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
>>>> change the starting value a splat of the new start can be added, and changing
>>>> the step requires multiplying by a splat.
>>>>
>>>> Alternatives Considered
>>>> -----------------------
>>>>
>>>> We have chosen to represent these as constants to make pattern matching and
>>>> constant folding easy, particularly for the mask argument of the
>>>> `shufflevector` instruction.
>>>>
>>>> Another possibility is to use intrinsics similar to the old instructions, but
>>>> currently these cannot be treated as constants. We wouldn't mind using
>>>> intrinsics here, but would want to discuss how best to support constant folding
>>>> and pattern matching without needing to add a lot more code.
>>>>
>>>> =======
>>>> Example
>>>> =======
>>>>
>>>> The following example shows a simple C loop which assigns the array index to
>>>> the array elements matching that index. The IR shows how vscale and stepvector
>>>> are used to create the needed values and to advance the index variable in the
>>>> loop.
>>>>
>>>> C Code
>>>> ------
>>>>
>>>> ``
>>>> void IdentityArrayInit(int *a, int count) {
>>>> for (int i = 0; i < count; ++i)
>>>>   a[i] = i;
>>>> }
>>>> ``
>>>>
>>>> Scalable IR Vector Body
>>>> -----------------------
>>>>
>>>> ``
>>>> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>> %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>>
>>>>          ;; stepvector used for index identity on entry to loop body ;;
>>>> %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>>>>                               [ stepvector, %vector.body.preheader ]
>>>> %1 = add i64 %0, mul (i64 vscale, i64 4)
>>>>
>>>>          ;; vscale splat used to increment identity vector ;;
>>>> %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>>>>                                                                                       i32 mul (i32 vscale, i32 4), i32 0),
>>>>                                                                                       <n x 4 x i32> undef,
>>>>                                                                                       <n x 4 x i32> zeroinitializer)
>>>
>>> This syntax really gets me. Compare this with:
>>>
>>> %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>
>>>
>>> It's going to be hard to infer anything from that syntax, which means
>>> the resulting loop will be, for most purposes, as opaque as having an
>>> intrinsic in there.
>>
>> It looks nice for fixed-length when the element values are immediate constants.
>> If instead of '4' it was a value from an argument or loaded from memory, then it
>> would look similar -- that's how splats are represented in IR, and using an
>> intrinsic wouldn't change that. Existing optimizations will recognize the
>> splat in this form.
>>
>>>
>>>
>>>> %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>> %3 = bitcast i32* %2 to <n x 4 x i32>*
>>>> store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>>>>
>>>>          ;; vscale used to increment loop index
>>>> %index.next = add i64 %index, mul (i64 vscale, i64 4)
>>>> %4 = icmp eq i64 %index.next, %n.vec
>>>> br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>>> ``
>>>>
>>>> =====================
>>>> Questions and Answers
>>>> =====================
>>>>
>>>> Can the vector length change at runtime?
>>>> ----------------------------------------
>>>>
>>>> It is possible to change vector length at runtime, but there is no model defined
>>>> for how to resolve all the issues that arise when doing so. From the compiler's
>>>> point of view, it is expected that vector length can be treated as constant but
>>>> unknown.
>>>>
>>>> Since there's currently no way to indicate to the compiler where a change might
>>>> occur, a programmer deliberately requesting changes to the vector length in the
>>>> middle of a function containing autovectorized code would be considered a bug.
>>>
>>> User error. Yes.
>>>
>>>
>>>> If changing the VL at runtime becomes desireable at some point in the future,
>>>> the types and constants presented in this design would not need to change; we
>>>> would need some way of creating a barrier between sections of IR which might
>>>> have a different vscale, but that would be an addition to the current design
>>>> instead of a change.
>>>
>>> No changes to IR, AFAIK. Just on how we use it.
>>>
>>>
>>>> How do we spill/fill scalable registers on the stack?
>>>> -----------------------------------------------------
>>>>
>>>> SVE registers have a (partially) unknown size at build time and their associated
>>>> fill/spill instructions require an offset that is implicitly scaled by the
>>>> vector length instead of bytes or element size. To accommodate this we
>>>> created the concept of Stack Regions that are areas on the stack associated
>>>> with specific data types or register classes.
>>>>
>>>> MachineFrameInfo has been extended with a list of Stack Regions that each
>>>> contain a list of associated objects. Each Stack Region controls its own
>>>> allocation, deallocation and internal offset calculations. For SVE we create
>>>> separate regions for data and predicate registers, so the exact layout does
>>>> not need to be known ahead of time, just relative offsets within regions.
>>>
>>> Can this be mapped into IR address spaces?
>>
>> I don't think these need to map into separate address spaces; we haven't done
>> any investigation along those lines as the default single address space works
>> fine with this.
>>
>>>
>>> If some target make use of changing VL as you describe above, this
>>> will break catastrophically, I imagine.
>>
>> Yes; there's a great many problems with changing VL at runtime, which is
>> why we consider it a bug right now. I'm not aware of anyone actually asking
>> for the functionality to be supported.
>>
>>>
>>>
>>>> Objects not belonging to a Stack Region use the default handling, so other
>>>> existing targets are not affected.
>>>>
>>>> A more complete design for this will be detailed in a dedicated RFC later.
>>>
>>> I think this was a contentious enough point that might need a peek
>>> into that RFC.
>>>
>>> Targets that don't have SVE will not suffer from changes in
>>> MachineFrameInfo, but AArch64 may, and I'm curious on how that's
>>> solved.
>>
>> Noted. We'll try and schedule some time to write it up soonish.
>>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
In reply to this post by Gerolf Hoflehner via llvm-dev
On Jun 1, 2017, at 7:22 AM, Graham Hunter via llvm-dev <[hidden email]> wrote:
> Hi,
>
> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.

Thanks for sending this out Graham.  Here are some comments:

> =====
> Types
> =====
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class. Most code that deals with vectors doesn't need
> to know the exact length, but does need to know relative lengths -- e.g. get
> a vector with the same number of elements but a different element type, or with
> half or double the number of elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of static and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.

This is a clever approach to unifying the two concepts, and I think that the approach is basically reasonable.  The primary problem that this will introduce is:

1) Almost anything touching (e.g. transforming) vector operations will have to be aware of this concept.  Given a first class implementation of SVE, I don’t see how that’s avoidable though, and your extension of VectorType is sensible.

2) This means that VectorType is sometimes fixed size, and sometime unknowable.  I don’t think we have an existing analog for that in the type system.  

Is this type a first class type?  Can you PHI them, can you load/store them, can you pass them as function arguments without limitations?  If not, that is a serious problem.  How does struct layout with a scalable vector in it work?  What does an alloca of one of them look like?  What does a spill look like in codegen?

> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<[n x ]<m> x <type>>``
>
> where `type` is the scalar type of each element, `m` is the minimum number of
> elements, and the string literal `n x` indicates that the total number of
> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
> indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `n x` is omitted, so there is no change in the
> format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same:
>
> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>
> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.

It’s a trivial syntactic issue, but I’d suggest something more along the lines of:

<scalable 4 x i32>

or something like that, just to make it easier to read.

> Alternatives Considered
> -----------------------
>
> We had two alternatives in mind -- a dedicated target specific type, and a
> subclass inheriting from VectorType.

I think that a target-specific type (e.g. like we have X86_mmx) is the only reasonable alternative.  A subclass of VectorType is just another implementation approach of your design above.  This is assuming that scalable vectors are really first class types.

The pros and cons of a separate type is that it avoids you having to touch everything that touches VectorTypes, and if it turns out that the code that needs to handle normal SIMD and scalable SIMD vectors is different, then it is a win to split them into two types.  If, on the other hand, most code would treat the two types similarly, then it is better to just have one type.

The major concern I have here is that I’m not sure how scalable vectors can be considered to be first class types, given that we don’t know their size.  If they can’t be put in an LLVM struct (for example), then this would pose a significant problem with your current approach.  It would be a huge problem if VectorType could be in structs in some cases, but not others.

> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.

Agreed, that seems fine to me if true.

> =================
> Runtime Constants
> =================
>
> With a scalable vector type defined, we now need a way to generate addresses for
> consecutive vector values in memory and to be able to create basic constant
> vector values.
>
> For address generation, the `vscale` constant is added to represent the runtime
> value of `n` in `<n x m x type>`.

This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.

> Multiplying `vscale` by `m` and the number of
> bytes in `type` gives the total length of a scalable vector, and the backend
> can pattern match to the various load and store instructions in SVE that
> automatically scale with vector length.

It is fine for the intrinsic to turn into a target specific ISD node in selection dag to allow your pattern matching.

> =====================
> Questions and Answers
> =====================
>
> Can the vector length change at runtime?
> ----------------------------------------
>
> It is possible to change vector length at runtime, but there is no model defined
> for how to resolve all the issues that arise when doing so. From the compiler's
> point of view, it is expected that vector length can be treated as constant but
> unknown.

The way I would explain it is that it is a (load time) constant.  There is no practical way for software to handle this case, so even if hardware can do it, it is a non goal to support it.

> How do we spill/fill scalable registers on the stack?
> -----------------------------------------------------
>
> SVE registers have a (partially) unknown size at build time and their associated
> fill/spill instructions require an offset that is implicitly scaled by the
> vector length instead of bytes or element size. To accommodate this we
> created the concept of Stack Regions that are areas on the stack associated
> with specific data types or register classes.

Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.

-Chris

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
[Sending again to list]

Hi Chris,

Responses inline...

On 6 July 2017 at 21:02, Chris Lattner via llvm-dev
<[hidden email]> wrote:
>
> Thanks for sending this out Graham.  Here are some comments:
>
> This is a clever approach to unifying the two concepts, and I think that the approach is basically reasonable.  The primary problem that this will introduce is:
>
> 1) Almost anything touching (e.g. transforming) vector operations will have to be aware of this concept.  Given a first class implementation of SVE, I don’t see how that’s avoidable though, and your extension of VectorType is sensible.

Yes, however we have found that the vast majority of vector transforms
don't need any modification to deal with scalable types. There are
obviously exceptions, things like analysing shuffle vector masks for
specific patterns etc.

>
> 2) This means that VectorType is sometimes fixed size, and sometime unknowable.  I don’t think we have an existing analog for that in the type system.
>
> Is this type a first class type?  Can you PHI them, can you load/store them, can you pass them as function arguments without limitations?  If not, that is a serious problem.  How does struct layout with a scalable vector in it work?  What does an alloca of one of them look like?  What does a spill look like in codegen?
Yes, as an extension to VectorType they can be manipulated and passed
around like normal vectors, load/stored directly, phis, put in llvm
structs etc. Address computation generates expressions in terms vscale
and it seems to work well.
>
> I think that a target-specific type (e.g. like we have X86_mmx) is the only reasonable alternative.  A subclass of VectorType is just another implementation approach of your design above.  This is assuming that scalable vectors are really first class types.
>
> The pros and cons of a separate type is that it avoids you having to touch everything that touches VectorTypes, and if it turns out that the code that needs to handle normal SIMD and scalable SIMD vectors is different, then it is a win to split them into two types.  If, on the other hand, most code would treat the two types similarly, then it is better to just have one type.

Fortunately the latter case is exactly what we've found. Most
operations on vectors are not actually concerned with their absolute
size, and more usually concerned with relative sizes if anything.
>
> The major concern I have here is that I’m not sure how scalable vectors can be considered to be first class types, given that we don’t know their size.  If they can’t be put in an LLVM struct (for example), then this would pose a significant problem with your current approach.  It would be a huge problem if VectorType could be in structs in some cases, but not others.
We can have them as first class types but as you say it does require
us to be careful with reasoning about their sizes. In practice there
are architectural limits on the sizes of vectors, so it's possible to
have an upper bound on the size. However to be completely accurate,
type sizes in LLVM probably need to have some symbolic representation
such that we can reason about their sizes in terms of, essentially,
the vscale constant. The other potential avenue is to make all type
size queries in LLVM return optional values. We haven't implemented
either of these and we haven't yet hit an issue, not to say there
isn't one. I think most of the uses of querying type sizes are to
compare against other type sizes, so relative comparisons still work
even with scalable types. This area is something we want some
community input to build consensus on though.


>> With a scalable vector type defined, we now need a way to generate addresses for
>> consecutive vector values in memory and to be able to create basic constant
>> vector values.
>>
>> For address generation, the `vscale` constant is added to represent the runtime
>> value of `n` in `<n x m x type>`.
>
> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
Could you explain your position more on this? The Constant
architecture has been a very natural fit for this concept from our
perspective.
>
>> Multiplying `vscale` by `m` and the number of
>> bytes in `type` gives the total length of a scalable vector, and the backend
>> can pattern match to the various load and store instructions in SVE that
>> automatically scale with vector length.
>
> It is fine for the intrinsic to turn into a target specific ISD node in selection dag to allow your pattern matching.
>

>
>> How do we spill/fill scalable registers on the stack?
>> -----------------------------------------------------
>>
>> SVE registers have a (partially) unknown size at build time and their associated
>> fill/spill instructions require an offset that is implicitly scaled by the
>> vector length instead of bytes or element size. To accommodate this we
>> created the concept of Stack Regions that are areas on the stack associated
>> with specific data types or register classes.
>
> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
Could you give an example?

Thanks for taking the time to review this,
Amara
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
On Jul 6, 2017, at 3:03 PM, Amara Emerson <[hidden email]> wrote:
>> 1) Almost anything touching (e.g. transforming) vector operations will have to be aware of this concept.  Given a first class implementation of SVE, I don’t see how that’s avoidable though, and your extension of VectorType is sensible.
>
> Yes, however we have found that the vast majority of vector transforms
> don't need any modification to deal with scalable types. There are
> obviously exceptions, things like analysing shuffle vector masks for
> specific patterns etc.

Ok great.

>> 2) This means that VectorType is sometimes fixed size, and sometime unknowable.  I don’t think we have an existing analog for that in the type system.
>>
>> Is this type a first class type?  Can you PHI them, can you load/store them, can you pass them as function arguments without limitations?  If not, that is a serious problem.  How does struct layout with a scalable vector in it work?  What does an alloca of one of them look like?  What does a spill look like in codegen?
> Yes, as an extension to VectorType they can be manipulated and passed
> around like normal vectors, load/stored directly, phis, put in llvm
> structs etc. Address computation generates expressions in terms vscale
> and it seems to work well.

Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.

>>> With a scalable vector type defined, we now need a way to generate addresses for
>>> consecutive vector values in memory and to be able to create basic constant
>>> vector values.
>>>
>>> For address generation, the `vscale` constant is added to represent the runtime
>>> value of `n` in `<n x m x type>`.
>>
>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
> Could you explain your position more on this? The Constant
> architecture has been a very natural fit for this concept from our
> perspective.

It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.

A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.

>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
> Could you give an example?

The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.

-Chris

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
On 6 July 2017 at 23:13, Chris Lattner <[hidden email]> wrote:
>> Yes, as an extension to VectorType they can be manipulated and passed
>> around like normal vectors, load/stored directly, phis, put in llvm
>> structs etc. Address computation generates expressions in terms vscale
>> and it seems to work well.
>
> Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.
Although the absolute size of the types aren't known at compile time,
there are upper bounds which the compiler can assume and use to allow
allocation of storage for global variables and the like. The issue
with composite type sizes again reduce to the issue of type sizes
being either symbolic expressions or simply unknown in some cases.

>>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
>> Could you explain your position more on this? The Constant
>> architecture has been a very natural fit for this concept from our
>> perspective.
>
> It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.
>
> A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.
Ok, we'll investigate this issue further.
>
>>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
>> Could you give an example?
>
> The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.
I think this should still work. Allocas of scalable vectors are supported,
and it's only later at codegen that the unknown sizes result in more
work being needed to compute stack offsets correctly. The caveat being
that a direct call to something like getTypeStoreSize() will need to
be aware of expressions/sizeless-types. If however these passes are
exclusively using allocas to put registers into memory, or using
structs with extractvalue etc, then they shouldn't need to care and
codegen deals with the low level details.

Thanks,
Amara
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hi Amara,

I was wondering if you have a link to a suggested programming model in mind for SVE and how it'll interact with other vector operations?

-eric

On Thu, Jul 6, 2017 at 3:53 PM Amara Emerson via llvm-dev <[hidden email]> wrote:
On 6 July 2017 at 23:13, Chris Lattner <[hidden email]> wrote:
>> Yes, as an extension to VectorType they can be manipulated and passed
>> around like normal vectors, load/stored directly, phis, put in llvm
>> structs etc. Address computation generates expressions in terms vscale
>> and it seems to work well.
>
> Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.
Although the absolute size of the types aren't known at compile time,
there are upper bounds which the compiler can assume and use to allow
allocation of storage for global variables and the like. The issue
with composite type sizes again reduce to the issue of type sizes
being either symbolic expressions or simply unknown in some cases.

>>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
>> Could you explain your position more on this? The Constant
>> architecture has been a very natural fit for this concept from our
>> perspective.
>
> It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.
>
> A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.
Ok, we'll investigate this issue further.
>
>>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
>> Could you give an example?
>
> The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.
I think this should still work. Allocas of scalable vectors are supported,
and it's only later at codegen that the unknown sizes result in more
work being needed to compute stack offsets correctly. The caveat being
that a direct call to something like getTypeStoreSize() will need to
be aware of expressions/sizeless-types. If however these passes are
exclusively using allocas to put registers into memory, or using
structs with extractvalue etc, then they shouldn't need to care and
codegen deals with the low level details.

Thanks,
Amara
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hello,

shouldn't be OpenMP SIMD natural fit for SVE? If so special attention is to be paid to simd functions (#pragma omp declare simd) which allows passing and returning vector values.

Reagrds,
Serge.



07.07.2017, 10:02, "Eric Christopher via llvm-dev" <[hidden email]>:

> Hi Amara,
>
> I was wondering if you have a link to a suggested programming model in mind for SVE and how it'll interact with other vector operations?
>
> -eric
>
> On Thu, Jul 6, 2017 at 3:53 PM Amara Emerson via llvm-dev <[hidden email]> wrote:
>> On 6 July 2017 at 23:13, Chris Lattner <[hidden email]> wrote:
>>>> Yes, as an extension to VectorType they can be manipulated and passed
>>>> around like normal vectors, load/stored directly, phis, put in llvm
>>>> structs etc. Address computation generates expressions in terms vscale
>>>> and it seems to work well.
>>>
>>> Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.
>> Although the absolute size of the types aren't known at compile time,
>> there are upper bounds which the compiler can assume and use to allow
>> allocation of storage for global variables and the like. The issue
>> with composite type sizes again reduce to the issue of type sizes
>> being either symbolic expressions or simply unknown in some cases.
>>
>>>>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
>>>> Could you explain your position more on this? The Constant
>>>> architecture has been a very natural fit for this concept from our
>>>> perspective.
>>>
>>> It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.
>>>
>>> A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.
>> Ok, we'll investigate this issue further.
>>>
>>>>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
>>>> Could you give an example?
>>>
>>> The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.
>> I think this should still work. Allocas of scalable vectors are supported,
>> and it's only later at codegen that the unknown sizes result in more
>> work being needed to compute stack offsets correctly. The caveat being
>> that a direct call to something like getTypeStoreSize() will need to
>> be aware of expressions/sizeless-types. If however these passes are
>> exclusively using allocas to put registers into memory, or using
>> structs with extractvalue etc, then they shouldn't need to care and
>> codegen deals with the low level details.
>>
>> Thanks,
>> Amara
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> ,
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
Hi,

Yes, OpenMP simd pragmas work quite well with sve, and there's a work-in-progress vector ABI for 'declare simd' pragmas to vectorize whole functions.

We also have a tentative ACLE specification, to allow the use of sve via C-level intrinsics. The specification is here: http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665619/acle_sve_100987_0000_00_en.pdf

There's a couple of examples here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100891_0607_00_en/wap1490203634804.html

-Graham


> On 7 Jul 2017, at 06:56, Сергей Прейс via llvm-dev <[hidden email]> wrote:
>
> Hello,
>
> shouldn't be OpenMP SIMD natural fit for SVE? If so special attention is to be paid to simd functions (#pragma omp declare simd) which allows passing and returning vector values.
>
> Reagrds,
> Serge.
>
>
>
> 07.07.2017, 10:02, "Eric Christopher via llvm-dev" <[hidden email]>:
>> Hi Amara,
>>
>> I was wondering if you have a link to a suggested programming model in mind for SVE and how it'll interact with other vector operations?
>>
>> -eric
>>
>> On Thu, Jul 6, 2017 at 3:53 PM Amara Emerson via llvm-dev <[hidden email]> wrote:
>>> On 6 July 2017 at 23:13, Chris Lattner <[hidden email]> wrote:
>>>>> Yes, as an extension to VectorType they can be manipulated and passed
>>>>> around like normal vectors, load/stored directly, phis, put in llvm
>>>>> structs etc. Address computation generates expressions in terms vscale
>>>>> and it seems to work well.
>>>>
>>>> Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.
>>> Although the absolute size of the types aren't known at compile time,
>>> there are upper bounds which the compiler can assume and use to allow
>>> allocation of storage for global variables and the like. The issue
>>> with composite type sizes again reduce to the issue of type sizes
>>> being either symbolic expressions or simply unknown in some cases.
>>>
>>>>>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
>>>>> Could you explain your position more on this? The Constant
>>>>> architecture has been a very natural fit for this concept from our
>>>>> perspective.
>>>>
>>>> It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.
>>>>
>>>> A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.
>>> Ok, we'll investigate this issue further.
>>>>
>>>>>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
>>>>> Could you give an example?
>>>>
>>>> The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.
>>> I think this should still work. Allocas of scalable vectors are supported,
>>> and it's only later at codegen that the unknown sizes result in more
>>> work being needed to compute stack offsets correctly. The caveat being
>>> that a direct call to something like getTypeStoreSize() will need to
>>> be aware of expressions/sizeless-types. If however these passes are
>>> exclusively using allocas to put registers into memory, or using
>>> structs with extractvalue etc, then they shouldn't need to care and
>>> codegen deals with the low level details.
>>>
>>> Thanks,
>>> Amara
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> ,
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Gerolf Hoflehner via llvm-dev
In reply to this post by Gerolf Hoflehner via llvm-dev

> On 6 Jul 2017, at 23:53, Amara Emerson <[hidden email]> wrote:
>
> On 6 July 2017 at 23:13, Chris Lattner <[hidden email]> wrote:
>>> Yes, as an extension to VectorType they can be manipulated and passed
>>> around like normal vectors, load/stored directly, phis, put in llvm
>>> structs etc. Address computation generates expressions in terms vscale
>>> and it seems to work well.
>>
>> Right, that works out through composition, but what does it mean?  I can't have a global variable of a scalable vector type, nor does it make sense for a scalable vector to be embeddable in an LLVM IR struct: nothing that measures the size of a struct is prepared to deal with a non-constant answer.
> Although the absolute size of the types aren't known at compile time,
> there are upper bounds which the compiler can assume and use to allow
> allocation of storage for global variables and the like. The issue
> with composite type sizes again reduce to the issue of type sizes
> being either symbolic expressions or simply unknown in some cases.

To elaborate a bit more on this, for our current compiler we have a fixed upper bound (256B/2048b) on size for scalable vector types in IR. It gets us working code, but isn't truly scalable. However, this doesn't work when calculating offsets within structs when building SelectionDAG nodes; we changed the optional "Offsets" argument to ComputeValueVTs to be a SmallVectorImpl of pairs, representing scaled and unscaled byte offsets. Scaled offsets must be multiplied by vscale to get the correct addresses. It's possible this mechanism could be used in IR as well, I'll investigate. Anyone have a strong objection to that idea up front?

As far as having scalable vectors inside IR structs, we do support that for two reasons. One, it makes it easier to write unit tests where we can return multiple scalable vectors from a function, which requires structs (afaik). Two, our C-level intrinsics support 'sizeless structs' containing scalable vector types; this is why we needed to support lowering and codegen for that.

We don't support global/static scalable vector variables in C, though -- I think that would need some ELF changes, and we're not looking into that afaik.

>
>>>> This should probably be an intrinsic, not an llvm::Constant.  The design of llvm::Constant is already wrong: it shouldn’t have operations like divide, and it would be better to not contribute to the problem.
>>> Could you explain your position more on this? The Constant
>>> architecture has been a very natural fit for this concept from our
>>> perspective.
>>
>> It is appealing, but it is wrong.  Constant should really only model primitive constants (ConstantInt/FP, etc) and we should have one more form for “relocatable” constants.  Instead, we have intertwined constant folding and ConstantExpr logic that doesn’t make sense.
>>
>> A better pattern to follow are intrinsics like (e.g.) llvm.coro.size.i32(), which always returns a constant value.
> Ok, we'll investigate this issue further.

So I've looked into this, and have a question. Would we be able to add a llvm.sve.vscale.i32 (or whatever we end up naming it) to the ConstantExpr hierarchy? I didn't see that with the coroutine intrinsics, but maybe I missed something. I do see the CoroSizeInst class for matching though, that helps a bit.

To support shufflevector with scalable types, we had to relax the constraints on the mask from being constant literals to just a ConstantExpr; if we can't make it constant in some way we'd need to accept plain Values for the mask, I think.

There's also potential issues in instruction selection if they aren't constant; if a non-constantexpr value is hoisted out of a given block we might not have the right nodes to select against and end up with terrible codegen.

-Graham

>>
>>>> Ok, that sounds complicated, but can surely be made to work.  The bigger problem is that there are various LLVM IR transformations that want to put registers into memory.  All of these will be broken with this sort of type.
>>> Could you give an example?
>>
>> The concept of “reg2mem” is to put SSA values into allocas for passes that can’t (or don’t want to) update SSA.  Similarly, function body extraction can turn SSA values into parameters, and depending on the implementation can pack them into structs.  The coroutine logic similarly needs to store registers if they cross suspend points, there are multiple other examples.
> I think this should still work. Allocas of scalable vectors are supported,
> and it's only later at codegen that the unknown sizes result in more
> work being needed to compute stack offsets correctly. The caveat being
> that a direct call to something like getTypeStoreSize() will need to
> be aware of expressions/sizeless-types. If however these passes are
> exclusively using allocas to put registers into memory, or using
> structs with extractvalue etc, then they shouldn't need to care and
> codegen deals with the low level details.

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev