[llvm-dev] Rotates, once again

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
Hi everyone!

I recently ran into some interesting issues with generation of rotate instructions - the details are in the bug tracker (https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those interested - and it brought up the issue of rotates in the IR again. Now this is a proposal that has been made (and been rejected) several times, but I've been told that this time round we ran into issues not previously considered, and that I should bring the topic up on llvm-dev.

I'll try to summarize the status quo first and then bring up the tricky cases that might tip the balance in favor of rotate intrinsics.

Rotates
=======

Rotates, or circular shifts, are bit shifts where bits "dropped out" on one side of the operand are shifted back in on the other.

They occur less frequently in real-world code than regular (non-circular) bit shifts, but are nevertheless directly supported in most popular ISAs. They are useful primitives for cryptography, non-cryptographic hashing, pseudo-random number generation, bit I/O, and probably other cases as well.

They do not have a direct representation in C/C++, but several compilers support them via intrinsics (including MSVC, which Clang-CL tries to be compatible to), and some languages such as Rust include them explicitly.

The status quo
==============

Rotates by compile-time constant amounts are fairly easy to express in C-like languages by ORing together two opposite-direction shifts. For example, rotating a uint32_t  x left by 3 positions in C boils down to

  (x << 3) | (x >> 29)

which is simple enough. The backends with rotate support detect this type of pattern at the DAG stage and map it to a rotate instruction.

This, in a nutshell, is the argument against a direct rotate in the IR: the existing primitives are sufficient to model it, and doing so allows LLVM to leverage transforms and analyses such as SimplifyDemandedBits that know how to deal with the more common regular shifts to work on bit rotate expressions as well. Adding a new primitive would mean that a whole bunch of optimization passes would need to know how to deal with it (or else they might lose on optimization opportunities).

Variable-distance rotates
=========================

Things become murkier when we consider not just rotates by a compile-time constant, but also rotates where the amount is determined at run time.

For one thing, it's not immediately obvious how to write a well-defined corresponding C expression: the most "natural" version (for a 32-bit left rotate) is

  (x << k) | (x >> (32 - k))

but this is undefined for k==0 due to the right-shift of a 32-bit value by 32. That said, both GCC and Clang will generally compile this particular expression into a rotate-left as of this writing, but having UB is unfortunate. A trickier variant that is always well-defined is

  (x << (k & 31)) | (x >> (-k & 31))

but this is a substantially more complicated sub-dag  than the OR of two shifts you would see for a compile-time constant rotate distance, and there are more opportunities for things to go wrong along the way. Bug 37387 contains a few examples, the simplest of which was given by Sanjay:

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

"32 - b" is loop-invariant and gets hoisted, and what's left of the expression doesn't match a known "rotate" pattern (since the DAG can't see across BB boundaries at the moment), so instead we get two shifts and an OR inside the loop.

The counterargument here was that GlobalISel should eventually fix this particular problem, but nevertheless, as of today, LICM breaks recognition of the rotate pattern in this (very simple) loop, and there is reason to believe that other transforms (existing or forthcoming) might well break the pattern as well; the issues referenced in the linked bug illustrate some of the existing failure modes.

This situation is frustrating for users of variable-distance rotates; right now, there simply is no way to write code in a source language that will reliably turn into rotates; you either have to watch the generated code closely, or just throw up your hands and use platform-dependent inline assembly. (It is assumed that the code in question is in a hot spot.)

The proposal
============

The idea would be to add rotate intrinsics, primarily intended to be used for the variable-distance case, which is otherwise fragile.

Frontends can emit them directly (for source language-level intrinsics, or when matched from "known good" rotate patterns) and the backends can match them and select appropriate instructions.

For the constant-distance case, the preferred form would still be the OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to existing optimization passes.

SDB does not currently do anything with shifts by non-constant amounts anyway; and, for the use cases in question, a rotate that is not touched by other optimization passes is still preferable to a more complex expression that can be partially optimized, but then ends up in a form that the backend doesn't know how to turn back into a rotate, which is a net loss.

Thoughts?

-Fabian
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
It might be worth generalizing this to an arbitrary bit-select.

My application needs to extract contiguous bits of multi-precision numbers. e.g. given a 128-bit number stored in an array a[2] and we want to select 64 bits starting from position N:

    result = (a[1] << (64-N)) | (a[0] >> N);

which is basically the same as your rotate if a[1] == a[0].  Some machines (e.g. x86 and PA-RISC) have this instruction in hardware. On x86 this is a double-edged sword, as SHLD/SHRD may be microcoded and it may be faster just to expand to 2 shifts and an OR. Even so, it might help if I could encode this more directly.


On Mon, May 14, 2018 at 4:53 AM, Fabian Giesen via llvm-dev <[hidden email]> wrote:
Hi everyone!

I recently ran into some interesting issues with generation of rotate instructions - the details are in the bug tracker (https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those interested - and it brought up the issue of rotates in the IR again. Now this is a proposal that has been made (and been rejected) several times, but I've been told that this time round we ran into issues not previously considered, and that I should bring the topic up on llvm-dev.

I'll try to summarize the status quo first and then bring up the tricky cases that might tip the balance in favor of rotate intrinsics.

Rotates
=======

Rotates, or circular shifts, are bit shifts where bits "dropped out" on one side of the operand are shifted back in on the other.

They occur less frequently in real-world code than regular (non-circular) bit shifts, but are nevertheless directly supported in most popular ISAs. They are useful primitives for cryptography, non-cryptographic hashing, pseudo-random number generation, bit I/O, and probably other cases as well.

They do not have a direct representation in C/C++, but several compilers support them via intrinsics (including MSVC, which Clang-CL tries to be compatible to), and some languages such as Rust include them explicitly.

The status quo
==============

Rotates by compile-time constant amounts are fairly easy to express in C-like languages by ORing together two opposite-direction shifts. For example, rotating a uint32_t  x left by 3 positions in C boils down to

  (x << 3) | (x >> 29)

which is simple enough. The backends with rotate support detect this type of pattern at the DAG stage and map it to a rotate instruction.

This, in a nutshell, is the argument against a direct rotate in the IR: the existing primitives are sufficient to model it, and doing so allows LLVM to leverage transforms and analyses such as SimplifyDemandedBits that know how to deal with the more common regular shifts to work on bit rotate expressions as well. Adding a new primitive would mean that a whole bunch of optimization passes would need to know how to deal with it (or else they might lose on optimization opportunities).

Variable-distance rotates
=========================

Things become murkier when we consider not just rotates by a compile-time constant, but also rotates where the amount is determined at run time.

For one thing, it's not immediately obvious how to write a well-defined corresponding C expression: the most "natural" version (for a 32-bit left rotate) is

  (x << k) | (x >> (32 - k))

but this is undefined for k==0 due to the right-shift of a 32-bit value by 32. That said, both GCC and Clang will generally compile this particular expression into a rotate-left as of this writing, but having UB is unfortunate. A trickier variant that is always well-defined is

  (x << (k & 31)) | (x >> (-k & 31))

but this is a substantially more complicated sub-dag  than the OR of two shifts you would see for a compile-time constant rotate distance, and there are more opportunities for things to go wrong along the way. Bug 37387 contains a few examples, the simplest of which was given by Sanjay:

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

"32 - b" is loop-invariant and gets hoisted, and what's left of the expression doesn't match a known "rotate" pattern (since the DAG can't see across BB boundaries at the moment), so instead we get two shifts and an OR inside the loop.

The counterargument here was that GlobalISel should eventually fix this particular problem, but nevertheless, as of today, LICM breaks recognition of the rotate pattern in this (very simple) loop, and there is reason to believe that other transforms (existing or forthcoming) might well break the pattern as well; the issues referenced in the linked bug illustrate some of the existing failure modes.

This situation is frustrating for users of variable-distance rotates; right now, there simply is no way to write code in a source language that will reliably turn into rotates; you either have to watch the generated code closely, or just throw up your hands and use platform-dependent inline assembly. (It is assumed that the code in question is in a hot spot.)

The proposal
============

The idea would be to add rotate intrinsics, primarily intended to be used for the variable-distance case, which is otherwise fragile.

Frontends can emit them directly (for source language-level intrinsics, or when matched from "known good" rotate patterns) and the backends can match them and select appropriate instructions.

For the constant-distance case, the preferred form would still be the OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to existing optimization passes.

SDB does not currently do anything with shifts by non-constant amounts anyway; and, for the use cases in question, a rotate that is not touched by other optimization passes is still preferable to a more complex expression that can be partially optimized, but then ends up in a form that the backend doesn't know how to turn back into a rotate, which is a net loss.

Thoughts?

-Fabian
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
In reply to this post by Bruce Hoult via llvm-dev
Thanks for writing this up. I'd like to have this intrinsic too.

Another argument for having the intrinsic is shown in PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426

Vectorization goes overboard because the throughput cost model used by the vectorizers doesn't match the 6 IR instructions that correspond to 1 x86 rotate instruction. Instead, we have:

$ opt 37426prevectorize.ll -S -cost-model -analyze
...
Cost Model: Found an estimated cost of 1 for instruction:   %and = and i32 %cond, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shl = shl i32 %1, %and
Cost Model: Found an estimated cost of 1 for instruction:   %sub = sub nsw i32 0, %cond
Cost Model: Found an estimated cost of 1 for instruction:   %and5 = and i32 %sub, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shr = lshr i32 %1, %and5
Cost Model: Found an estimated cost of 1 for instruction:   %or = or i32 %shl, %shr

The broken cost model also affects unrolling and inlining. Size costs are overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the context of min/max/abs/fma too). But it would be simpler if we had a rotate intrinsic, and the 6-to-1 margin is the biggest I've seen.

Another potential benefit of a generic intrinsic is that it can replace target-specific intrinsics. PowerPC and x86 have those. ARM translates source-level target-specific intrinsics into regular IR, so that could use the intrinsic too.

On Mon, May 14, 2018 at 2:53 AM, Fabian Giesen via llvm-dev <[hidden email]> wrote:
Hi everyone!

I recently ran into some interesting issues with generation of rotate instructions - the details are in the bug tracker (https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those interested - and it brought up the issue of rotates in the IR again. Now this is a proposal that has been made (and been rejected) several times, but I've been told that this time round we ran into issues not previously considered, and that I should bring the topic up on llvm-dev.

I'll try to summarize the status quo first and then bring up the tricky cases that might tip the balance in favor of rotate intrinsics.

Rotates
=======

Rotates, or circular shifts, are bit shifts where bits "dropped out" on one side of the operand are shifted back in on the other.

They occur less frequently in real-world code than regular (non-circular) bit shifts, but are nevertheless directly supported in most popular ISAs. They are useful primitives for cryptography, non-cryptographic hashing, pseudo-random number generation, bit I/O, and probably other cases as well.

They do not have a direct representation in C/C++, but several compilers support them via intrinsics (including MSVC, which Clang-CL tries to be compatible to), and some languages such as Rust include them explicitly.

The status quo
==============

Rotates by compile-time constant amounts are fairly easy to express in C-like languages by ORing together two opposite-direction shifts. For example, rotating a uint32_t  x left by 3 positions in C boils down to

  (x << 3) | (x >> 29)

which is simple enough. The backends with rotate support detect this type of pattern at the DAG stage and map it to a rotate instruction.

This, in a nutshell, is the argument against a direct rotate in the IR: the existing primitives are sufficient to model it, and doing so allows LLVM to leverage transforms and analyses such as SimplifyDemandedBits that know how to deal with the more common regular shifts to work on bit rotate expressions as well. Adding a new primitive would mean that a whole bunch of optimization passes would need to know how to deal with it (or else they might lose on optimization opportunities).

Variable-distance rotates
=========================

Things become murkier when we consider not just rotates by a compile-time constant, but also rotates where the amount is determined at run time.

For one thing, it's not immediately obvious how to write a well-defined corresponding C expression: the most "natural" version (for a 32-bit left rotate) is

  (x << k) | (x >> (32 - k))

but this is undefined for k==0 due to the right-shift of a 32-bit value by 32. That said, both GCC and Clang will generally compile this particular expression into a rotate-left as of this writing, but having UB is unfortunate. A trickier variant that is always well-defined is

  (x << (k & 31)) | (x >> (-k & 31))

but this is a substantially more complicated sub-dag  than the OR of two shifts you would see for a compile-time constant rotate distance, and there are more opportunities for things to go wrong along the way. Bug 37387 contains a few examples, the simplest of which was given by Sanjay:

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

"32 - b" is loop-invariant and gets hoisted, and what's left of the expression doesn't match a known "rotate" pattern (since the DAG can't see across BB boundaries at the moment), so instead we get two shifts and an OR inside the loop.

The counterargument here was that GlobalISel should eventually fix this particular problem, but nevertheless, as of today, LICM breaks recognition of the rotate pattern in this (very simple) loop, and there is reason to believe that other transforms (existing or forthcoming) might well break the pattern as well; the issues referenced in the linked bug illustrate some of the existing failure modes.

This situation is frustrating for users of variable-distance rotates; right now, there simply is no way to write code in a source language that will reliably turn into rotates; you either have to watch the generated code closely, or just throw up your hands and use platform-dependent inline assembly. (It is assumed that the code in question is in a hot spot.)

The proposal
============

The idea would be to add rotate intrinsics, primarily intended to be used for the variable-distance case, which is otherwise fragile.

Frontends can emit them directly (for source language-level intrinsics, or when matched from "known good" rotate patterns) and the backends can match them and select appropriate instructions.

For the constant-distance case, the preferred form would still be the OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to existing optimization passes.

SDB does not currently do anything with shifts by non-constant amounts anyway; and, for the use cases in question, a rotate that is not touched by other optimization passes is still preferable to a more complex expression that can be partially optimized, but then ends up in a form that the backend doesn't know how to turn back into a rotate, which is a net loss.

Thoughts?

-Fabian
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
More food for thought:

1) Like bitwise rotate, Intel’s bitwise PDEP and PEXT instructions are expressible via mutliple pure C bitwise operations. Nevertheless, clang/LLVM has intrinsics for these bitwise instructions.
2) If we imagine an alternate universe where C had rotate operators from the beginning, then LLVM would probably have rotate intrinsics too, and we’d be discussing whether the LLVM rotate intrinsics were worth keeping or not given that clang could just emit a series of simpler bitwise operations. I’d imagine the examples below would be why LLVM would keep the rotate intrinsics (in this alternate universe).

Dave


On May 15, 2018, at 6:34 PM, Sanjay Patel via llvm-dev <[hidden email]> wrote:

Thanks for writing this up. I'd like to have this intrinsic too.

Another argument for having the intrinsic is shown in PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426

Vectorization goes overboard because the throughput cost model used by the vectorizers doesn't match the 6 IR instructions that correspond to 1 x86 rotate instruction. Instead, we have:

$ opt 37426prevectorize.ll -S -cost-model -analyze
...
Cost Model: Found an estimated cost of 1 for instruction:   %and = and i32 %cond, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shl = shl i32 %1, %and
Cost Model: Found an estimated cost of 1 for instruction:   %sub = sub nsw i32 0, %cond
Cost Model: Found an estimated cost of 1 for instruction:   %and5 = and i32 %sub, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shr = lshr i32 %1, %and5
Cost Model: Found an estimated cost of 1 for instruction:   %or = or i32 %shl, %shr

The broken cost model also affects unrolling and inlining. Size costs are overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the context of min/max/abs/fma too). But it would be simpler if we had a rotate intrinsic, and the 6-to-1 margin is the biggest I've seen.

Another potential benefit of a generic intrinsic is that it can replace target-specific intrinsics. PowerPC and x86 have those. ARM translates source-level target-specific intrinsics into regular IR, so that could use the intrinsic too.

On Mon, May 14, 2018 at 2:53 AM, Fabian Giesen via llvm-dev <[hidden email]> wrote:
Hi everyone!

I recently ran into some interesting issues with generation of rotate instructions - the details are in the bug tracker (https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those interested - and it brought up the issue of rotates in the IR again. Now this is a proposal that has been made (and been rejected) several times, but I've been told that this time round we ran into issues not previously considered, and that I should bring the topic up on llvm-dev.

I'll try to summarize the status quo first and then bring up the tricky cases that might tip the balance in favor of rotate intrinsics.

Rotates
=======

Rotates, or circular shifts, are bit shifts where bits "dropped out" on one side of the operand are shifted back in on the other.

They occur less frequently in real-world code than regular (non-circular) bit shifts, but are nevertheless directly supported in most popular ISAs. They are useful primitives for cryptography, non-cryptographic hashing, pseudo-random number generation, bit I/O, and probably other cases as well.

They do not have a direct representation in C/C++, but several compilers support them via intrinsics (including MSVC, which Clang-CL tries to be compatible to), and some languages such as Rust include them explicitly.

The status quo
==============

Rotates by compile-time constant amounts are fairly easy to express in C-like languages by ORing together two opposite-direction shifts. For example, rotating a uint32_t  x left by 3 positions in C boils down to

  (x << 3) | (x >> 29)

which is simple enough. The backends with rotate support detect this type of pattern at the DAG stage and map it to a rotate instruction.

This, in a nutshell, is the argument against a direct rotate in the IR: the existing primitives are sufficient to model it, and doing so allows LLVM to leverage transforms and analyses such as SimplifyDemandedBits that know how to deal with the more common regular shifts to work on bit rotate expressions as well. Adding a new primitive would mean that a whole bunch of optimization passes would need to know how to deal with it (or else they might lose on optimization opportunities).

Variable-distance rotates
=========================

Things become murkier when we consider not just rotates by a compile-time constant, but also rotates where the amount is determined at run time.

For one thing, it's not immediately obvious how to write a well-defined corresponding C expression: the most "natural" version (for a 32-bit left rotate) is

  (x << k) | (x >> (32 - k))

but this is undefined for k==0 due to the right-shift of a 32-bit value by 32. That said, both GCC and Clang will generally compile this particular expression into a rotate-left as of this writing, but having UB is unfortunate. A trickier variant that is always well-defined is

  (x << (k & 31)) | (x >> (-k & 31))

but this is a substantially more complicated sub-dag  than the OR of two shifts you would see for a compile-time constant rotate distance, and there are more opportunities for things to go wrong along the way. Bug 37387 contains a few examples, the simplest of which was given by Sanjay:

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

"32 - b" is loop-invariant and gets hoisted, and what's left of the expression doesn't match a known "rotate" pattern (since the DAG can't see across BB boundaries at the moment), so instead we get two shifts and an OR inside the loop.

The counterargument here was that GlobalISel should eventually fix this particular problem, but nevertheless, as of today, LICM breaks recognition of the rotate pattern in this (very simple) loop, and there is reason to believe that other transforms (existing or forthcoming) might well break the pattern as well; the issues referenced in the linked bug illustrate some of the existing failure modes.

This situation is frustrating for users of variable-distance rotates; right now, there simply is no way to write code in a source language that will reliably turn into rotates; you either have to watch the generated code closely, or just throw up your hands and use platform-dependent inline assembly. (It is assumed that the code in question is in a hot spot.)

The proposal
============

The idea would be to add rotate intrinsics, primarily intended to be used for the variable-distance case, which is otherwise fragile.

Frontends can emit them directly (for source language-level intrinsics, or when matched from "known good" rotate patterns) and the backends can match them and select appropriate instructions.

For the constant-distance case, the preferred form would still be the OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to existing optimization passes.

SDB does not currently do anything with shifts by non-constant amounts anyway; and, for the use cases in question, a rotate that is not touched by other optimization passes is still preferable to a more complex expression that can be partially optimized, but then ends up in a form that the backend doesn't know how to turn back into a rotate, which is a net loss.

Thoughts?

-Fabian
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
In reply to this post by Bruce Hoult via llvm-dev
On 5/15/2018 5:34 PM, Sanjay Patel via llvm-dev wrote:
> Thanks for writing this up. I'd like to have this intrinsic too.

I'd vote for the "valign" variant that David proposed. It becomes a
rotate when both inputs are the same.

<ty> %result = @llvm.valign.<ty>(<ty> %a0, <ty> %a1, i32 s)
result = truncate (concat(a1,a0) >> s)

Where "concat" concatenates the bits of both inputs to form a value of a
twice as long type, and "truncate" takes the lower half of the bits of
its input.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
In reply to this post by Bruce Hoult via llvm-dev
On 2018-05-16 00:34, Sanjay Patel via llvm-dev wrote:

> Vectorization goes overboard because the throughput cost model used by
> the
> vectorizers doesn't match the 6 IR instructions that correspond to 1
> x86
> rotate instruction. Instead, we have:
>
> [...]
>
> The broken cost model also affects unrolling and inlining. Size costs
> are
> overestimated for a target that has a rotate instruction.
> This cost problem isn't limited to rotate patterns (it's come up in the
> context of min/max/abs/fma too). But it would be simpler if we had a
> rotate
> intrinsic, and the 6-to-1 margin is the biggest I've seen.

Given that this is a general problem that occurs with other instruction
sequences as well, wouldn't it make more sense to make the cost model
handle more than one instruction, as suggested in PR31274 [1]?

[1] https://bugs.llvm.org/show_bug.cgi?id=31274

In all these cases (rotate, min, max, abs, fma, add-with-overflow, and
probably many more) there's a tradeoff between elaborating them as
simpler IR instructions and modelling them as its own instruction /
intrinsic.  A big disadvantage of introducing new instructions /
intrinsics is that all optimizations have to be told about this,
increasing the compiler code base and maintainability burden.  On the
other hand, too few instructions can make optimization difficult as well
(in theory, one instruction like "subtract and branch if not equal to
zero" could emulate all the others, but this wouldn't be very helpful
for optimization).  Since you put a lot of thought into how to
canonicalize IR, can you eleborate more on this tradeoff?  Can we find a
set of criteria to determine whether an instruction pattern should get
an intrinsic or not?

-Manuel
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev


On Wed, May 16, 2018 at 11:27 AM, Manuel Jacob <[hidden email]> wrote:
On 2018-05-16 00:34, Sanjay Patel via llvm-dev wrote:
Vectorization goes overboard because the throughput cost model used by the
vectorizers doesn't match the 6 IR instructions that correspond to 1 x86
rotate instruction. Instead, we have:

[...]

The broken cost model also affects unrolling and inlining. Size costs are
overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the
context of min/max/abs/fma too). But it would be simpler if we had a rotate
intrinsic, and the 6-to-1 margin is the biggest I've seen.

Given that this is a general problem that occurs with other instruction sequences as well, wouldn't it make more sense to make the cost model handle more than one instruction, as suggested in PR31274 [1]?

[1] https://bugs.llvm.org/show_bug.cgi?id=31274


Yes, I think everyone agrees that we need to improve the cost model interface. Note that there's a proposal for review for the 1 IR -> many machine ops variant of this question:

 
In all these cases (rotate, min, max, abs, fma, add-with-overflow, and probably many more) there's a tradeoff between elaborating them as simpler IR instructions and modelling them as its own instruction / intrinsic.  A big disadvantage of introducing new instructions / intrinsics is that all optimizations have to be told about this, increasing the compiler code base and maintainability burden.  On the other hand, too few instructions can make optimization difficult as well (in theory, one instruction like "subtract and branch if not equal to zero" could emulate all the others, but this wouldn't be very helpful for optimization).  Since you put a lot of thought into how to canonicalize IR, can you eleborate more on this tradeoff?  Can we find a set of criteria to determine whether an instruction pattern should get an intrinsic or not?

I don't have formal criteria for this. Someone more knowledgeable may give us an answer.

An informal metric might be: if the operation is supported as a primitive op or built-in in source languages and it is supported as a single target instruction, can we guarantee that 1-to-1 translation through optimization?
We are failing that test in the loop-invariant C example presented here. We're also failing that test when operands have different sizes. Rotate goes in, logic and shifts come out.
We even failed that test for an even more basic case before this fix: https://reviews.llvm.org/D46656
We may still be failing the basic case for other source languages/flavors because encoding this operation in existing IR ops isn't obvious?

Another informal measure is: how do programmers express this operation without a builtin? If they're resorting to inline asm, that's a strong sign that the compiler needs an intrinsic.

Another metric could be: what is the probability that the optimizer can improve the codegen if the operation is broken down into multiple IR instructions?
As noted here, it's unlikely for rotate. If it is possible, then adding folds to instcombine for this intrinsic isn't hard. Are any other passes affected?

For reference, these are the current target-independent bit-manipulation intrinsics - bswap, bitreverse, ctpop, ctlz, cttz:

The LLVM cost for the proposed rotate intrinsic should be in the same range as those? Note that we would not just be adding code to support an intrinsic. There are already ~200 lines of DAG matching code for rotate, so we already have a cost for trying to get this optimized. We can remove that after adding canonicalization to the intrinsic in IR.



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:

> An informal metric might be: if the operation is supported as a
> primitive op or built-in in source languages and it is supported as a
> single target instruction, can we guarantee that 1-to-1 translation
> through optimization?

It seems perfectly reasonable for LLVM users to expect this to happen
reliably.

I'd like to take a look at the other side of the equation: the cost of
adding a new intrinsic in terms of teaching passes to see through it, so
we don't lose optimizations that worked before the intrinsic was added.

For example, clearly ValueTracking needs a few lines added so that
computeKnownBits and friends don't get stopped by a rotate. Anyone have
a reasonably complete list of files that need similar changes?

John
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
What do we want do with masking of the amount on this proposed intrinsic. Do we need an explicit AND to keep it in bounds? X86 can delete the AND during isel since the hardware is well behaved for out of range values. Hardware only masks to 5-bits for 8/16 bit rotates for the purpose of flags, but the data will be modulo the bit width. Since we don't use the flags from rotates we can remove the mask. But if the mask is explicit in IR, then LICM might hoist it and isel won't see it to remove it.

~Craig


On Wed, May 16, 2018 at 1:21 PM John Regehr via llvm-dev <[hidden email]> wrote:
On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:

> An informal metric might be: if the operation is supported as a
> primitive op or built-in in source languages and it is supported as a
> single target instruction, can we guarantee that 1-to-1 translation
> through optimization?

It seems perfectly reasonable for LLVM users to expect this to happen
reliably.

I'd like to take a look at the other side of the equation: the cost of
adding a new intrinsic in terms of teaching passes to see through it, so
we don't lose optimizations that worked before the intrinsic was added.

For example, clearly ValueTracking needs a few lines added so that
computeKnownBits and friends don't get stopped by a rotate. Anyone have
a reasonably complete list of files that need similar changes?

John
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
In reply to this post by Bruce Hoult via llvm-dev
A rotate intrinsic should be relatively close in cost/complexity to the existing bswap.

A grep of intrinsic::bswap says we'd probably add code in:
InstCombine
InstructionSimplify
ConstantFolding
DemandedBits
ValueTracking
VectorUtils
SelectionDAGBuilder

But I don't think it's fair to view those additions as pure added cost. As an example, consider that we have to add hacks to EarlyCSE to recognize multi-IR-instruction min/max/abs patterns. Intrinsics just work as-is there. So if you search for 'matchSelectPattern', you get an idea (I see 32 hits in 10 files) of the cost of *not* having intrinsics for those operations that we've decided are not worthy of intrinsics.


On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev <[hidden email]> wrote:
On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:

An informal metric might be: if the operation is supported as a primitive op or built-in in source languages and it is supported as a single target instruction, can we guarantee that 1-to-1 translation through optimization?

It seems perfectly reasonable for LLVM users to expect this to happen reliably.

I'd like to take a look at the other side of the equation: the cost of adding a new intrinsic in terms of teaching passes to see through it, so we don't lose optimizations that worked before the intrinsic was added.

For example, clearly ValueTracking needs a few lines added so that computeKnownBits and friends don't get stopped by a rotate. Anyone have a reasonably complete list of files that need similar changes?

John

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
Thanks Sanjay!

At this point the cost/benefit tradeoff for rotate intrinsics seems
pretty good.

John


On 05/17/2018 11:14 AM, Sanjay Patel wrote:

> A rotate intrinsic should be relatively close in cost/complexity to the
> existing bswap.
>
> A grep of intrinsic::bswap says we'd probably add code in:
> InstCombine
> InstructionSimplify
> ConstantFolding
> DemandedBits
> ValueTracking
> VectorUtils
> SelectionDAGBuilder
>
> But I don't think it's fair to view those additions as pure added cost.
> As an example, consider that we have to add hacks to EarlyCSE to
> recognize multi-IR-instruction min/max/abs patterns. Intrinsics just
> work as-is there. So if you search for 'matchSelectPattern', you get an
> idea (I see 32 hits in 10 files) of the cost of *not* having intrinsics
> for those operations that we've decided are not worthy of intrinsics.
>
>
> On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:
>
>         An informal metric might be: if the operation is supported as a
>         primitive op or built-in in source languages and it is supported
>         as a single target instruction, can we guarantee that 1-to-1
>         translation through optimization?
>
>
>     It seems perfectly reasonable for LLVM users to expect this to
>     happen reliably.
>
>     I'd like to take a look at the other side of the equation: the cost
>     of adding a new intrinsic in terms of teaching passes to see through
>     it, so we don't lose optimizations that worked before the intrinsic
>     was added.
>
>     For example, clearly ValueTracking needs a few lines added so that
>     computeKnownBits and friends don't get stopped by a rotate. Anyone
>     have a reasonably complete list of files that need similar changes?
>
>     John
>
>     _______________________________________________
>     LLVM Developers mailing list
>     [hidden email] <mailto:[hidden email]>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
I'm guessing nobody has started implementing any of the suggested rotate functionality since there are still open questions, but let me know if I'm wrong.

We're still getting patches that try to work around the current limitations ( https://reviews.llvm.org/D48705 ), so we should move forward since we've approximated/justified the cost and benefits.

Let's settle on the intrinsic definition(s).

1. There was a suggestion to generalize rotate to a "valign" or "double shift" (that's what x86 means with its poorly worded "double precision shift"). How does that work with vector types? The options are a full vector-size shift or a per-element shift. If it's the full vector, do we still want/need a specialized rotate intrinsic for per-element? If it's per-element, do we still want/need the other form for a full vector?

2. What is the behavior for a shift/rotate amount that is equal or greater than the bit-width of the operand (or the bit width of a vector element type?)? Can we modulo that operand by the bit width, or does that not map well to the hardware semantics?

On Thu, May 17, 2018 at 5:23 PM, John Regehr <[hidden email]> wrote:
Thanks Sanjay!

At this point the cost/benefit tradeoff for rotate intrinsics seems pretty good.

John


On 05/17/2018 11:14 AM, Sanjay Patel wrote:
A rotate intrinsic should be relatively close in cost/complexity to the existing bswap.

A grep of intrinsic::bswap says we'd probably add code in:
InstCombine
InstructionSimplify
ConstantFolding
DemandedBits
ValueTracking
VectorUtils
SelectionDAGBuilder

But I don't think it's fair to view those additions as pure added cost. As an example, consider that we have to add hacks to EarlyCSE to recognize multi-IR-instruction min/max/abs patterns. Intrinsics just work as-is there. So if you search for 'matchSelectPattern', you get an idea (I see 32 hits in 10 files) of the cost of *not* having intrinsics for those operations that we've decided are not worthy of intrinsics.


On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev <[hidden email] <mailto:[hidden email]>> wrote:

    On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:

        An informal metric might be: if the operation is supported as a
        primitive op or built-in in source languages and it is supported
        as a single target instruction, can we guarantee that 1-to-1
        translation through optimization?


    It seems perfectly reasonable for LLVM users to expect this to
    happen reliably.

    I'd like to take a look at the other side of the equation: the cost
    of adding a new intrinsic in terms of teaching passes to see through
    it, so we don't lose optimizations that worked before the intrinsic
    was added.

    For example, clearly ValueTracking needs a few lines added so that
    computeKnownBits and friends don't get stopped by a rotate. Anyone
    have a reasonably complete list of files that need similar changes?

    John

    _______________________________________________
    LLVM Developers mailing list
    [hidden email] <mailto:[hidden email]>
    http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
    <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>




_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
On 7/2/2018 11:27 AM, Sanjay Patel via llvm-dev wrote:
>
> Let's settle on the intrinsic definition(s).
>
> 1. There was a suggestion to generalize rotate to a "valign" or "double
> shift" (that's what x86 means with its poorly worded "double precision
> shift"). How does that work with vector types? The options are a full
> vector-size shift or a per-element shift. If it's the full vector, do we
> still want/need a specialized rotate intrinsic for per-element? If it's
> per-element, do we still want/need the other form for a full vector?

The scalar rotate moves bits and such an operation doesn't make much
sense for moving data across lanes in vectors. I voted for the valign
variant originally, but I think that a per-element rotate would be the
natural vector version of the scalar operation.

It could still be doing the "double shift", since it's more general, it
just shouldn't be called valign. A per-byte cross-lane vector rotate (an
actual vector valign) can be implemented using shuffle, so I don't think
that an intrinsic for that is necessary.


-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
> On Jul 2, 2018, at 3:41 PM, Krzysztof Parzyszek via llvm-dev <[hidden email]> wrote:
>
> On 7/2/2018 11:27 AM, Sanjay Patel via llvm-dev wrote:
>> Let's settle on the intrinsic definition(s).
>> 1. There was a suggestion to generalize rotate to a "valign" or "double shift" (that's what x86 means with its poorly worded "double precision shift"). How does that work with vector types? The options are a full vector-size shift or a per-element shift. If it's the full vector, do we still want/need a specialized rotate intrinsic for per-element? If it's per-element, do we still want/need the other form for a full vector?
>
> The scalar rotate moves bits and such an operation doesn't make much sense for moving data across lanes in vectors. I voted for the valign variant originally, but I think that a per-element rotate would be the natural vector version of the scalar operation.
>
> It could still be doing the "double shift", since it's more general, it just shouldn't be called valign. A per-byte cross-lane vector rotate (an actual vector valign) can be implemented using shuffle, so I don’t think that an intrinsic for that is necessary.

Agreed. The per-element definition is the correct one.

– Steve
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
In reply to this post by Bruce Hoult via llvm-dev
1. I'm not sure what you mean by "full vector" here - using the same
shift distance for all lanes (as opposed to per-lane distances), or
doing a treat-the-vector-as-bag-of-bits shift that doesn't have any
internal lane boundaries? If the latter, that doesn't really help you
much with implementing a per-lane rotate.

I think the most useful generalization of a vector funnel shift in this
context is lane-wise

    result[i] = trunc(concat(a[i], b[i]) >> c[i])

(or the equivalent for a left shift); the special case a==b is a rotate.

2. For operand sizes that have native rotate instructions, at least x86,
x86-64, ARM A32/T32 and AArch64 A64 agree that rotate distances are
modulo the operand width. I believe PPC and MIPS do the same but am not
sure (it's been a while), no clue about other architectures.

It certainly seems the most natural way to define it, since rotates are
cyclic to begin with.

8- and 16-bit rotates will need to be lowered into multi-instruction
sequences on most targets (x86 and x86-64 can do them directly, but
RISC-lineage archs usually don't have rotates at smaller than word
size). Having explicit modulo semantics might end up forcing an explicit
extra AND there, so that's an extra cost there, but it would certainly
be nice to have the rotate definition be total.

-Fabian

On 07/02/2018 09:27 AM, Sanjay Patel wrote:

> I'm guessing nobody has started implementing any of the suggested
> rotate functionality since there are still open questions, but let me
> know if I'm wrong.
>
> We're still getting patches that try to work around the current
> limitations (https://reviews.llvm.org/D48705 
> <https://reviews.llvm.org/D48705> ), so we should move forward since
> we've approximated/justified the cost and benefits.
>
> Let's settle on the intrinsic definition(s).
>
> 1. There was a suggestion to generalize rotate to a "valign" or
> "double shift" (that's what x86 means with its poorly worded "double
> precision shift"). How does that work with vector types? The options
> are a full vector-size shift or a per-element shift. If it's the full
> vector, do we still want/need a specialized rotate intrinsic for
> per-element? If it's per-element, do we still want/need the other form
> for a full vector?
>
> 2. What is the behavior for a shift/rotate amount that is equal or
> greater than the bit-width of the operand (or the bit width of a
> vector element type?)? Can we modulo that operand by the bit width, or
> does that not map well to the hardware semantics?
>
> On Thu, May 17, 2018 at 5:23 PM, John Regehr <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Thanks Sanjay!
>
>     At this point the cost/benefit tradeoff for rotate intrinsics
>     seems pretty good.
>
>     John
>
>
>     On 05/17/2018 11:14 AM, Sanjay Patel wrote:
>
>         A rotate intrinsic should be relatively close in
>         cost/complexity to the existing bswap.
>
>         A grep of intrinsic::bswap says we'd probably add code in:
>         InstCombine
>         InstructionSimplify
>         ConstantFolding
>         DemandedBits
>         ValueTracking
>         VectorUtils
>         SelectionDAGBuilder
>
>         But I don't think it's fair to view those additions as pure
>         added cost. As an example, consider that we have to add hacks
>         to EarlyCSE to recognize multi-IR-instruction min/max/abs
>         patterns. Intrinsics just work as-is there. So if you search
>         for 'matchSelectPattern', you get an idea (I see 32 hits in 10
>         files) of the cost of *not* having intrinsics for those
>         operations that we've decided are not worthy of intrinsics.
>
>
>         On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev
>         <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email]
>         <mailto:[hidden email]>>> wrote:
>
>             On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:
>
>                 An informal metric might be: if the operation is
>         supported as a
>                 primitive op or built-in in source languages and it is
>         supported
>                 as a single target instruction, can we guarantee that
>         1-to-1
>                 translation through optimization?
>
>
>             It seems perfectly reasonable for LLVM users to expect this to
>             happen reliably.
>
>             I'd like to take a look at the other side of the equation:
>         the cost
>             of adding a new intrinsic in terms of teaching passes to
>         see through
>             it, so we don't lose optimizations that worked before the
>         intrinsic
>             was added.
>
>             For example, clearly ValueTracking needs a few lines added
>         so that
>             computeKnownBits and friends don't get stopped by a
>         rotate. Anyone
>             have a reasonably complete list of files that need similar
>         changes?
>
>             John
>
>             _______________________________________________
>             LLVM Developers mailing list
>         [hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>
>
>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
I also agree that the per-element rotate for vectors is what we want for this intrinsic.

So I have this so far:
declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %shift_amount)
For scalars, @llvm.catshift concatenates %a and %b, shifts the concatenated value right by the number of bits specified by %shift_amount modulo the bit-width, and truncates to the original bit-width. 
For vectors, that operation occurs for each element of the vector:
   result[i] = trunc(concat(a[i], b[i]) >> c[i])
If %a == %b, this is equivalent to a bitwise rotate right. Rotate left may be implemented by subtracting the shift amount from the bit-width of the scalar type or vector element type.


On Mon, Jul 2, 2018 at 2:37 PM, Fabian Giesen <[hidden email]> wrote:
1. I'm not sure what you mean by "full vector" here - using the same shift distance for all lanes (as opposed to per-lane distances), or doing a treat-the-vector-as-bag-of-bits shift that doesn't have any internal lane boundaries? If the latter, that doesn't really help you much with implementing a per-lane rotate.

I think the most useful generalization of a vector funnel shift in this context is lane-wise

   result[i] = trunc(concat(a[i], b[i]) >> c[i])

(or the equivalent for a left shift); the special case a==b is a rotate.

2. For operand sizes that have native rotate instructions, at least x86, x86-64, ARM A32/T32 and AArch64 A64 agree that rotate distances are modulo the operand width. I believe PPC and MIPS do the same but am not sure (it's been a while), no clue about other architectures.

It certainly seems the most natural way to define it, since rotates are cyclic to begin with.

8- and 16-bit rotates will need to be lowered into multi-instruction sequences on most targets (x86 and x86-64 can do them directly, but RISC-lineage archs usually don't have rotates at smaller than word size). Having explicit modulo semantics might end up forcing an explicit extra AND there, so that's an extra cost there, but it would certainly be nice to have the rotate definition be total.

-Fabian

On 07/02/2018 09:27 AM, Sanjay Patel wrote:
I'm guessing nobody has started implementing any of the suggested rotate functionality since there are still open questions, but let me know if I'm wrong.

We're still getting patches that try to work around the current limitations (https://reviews.llvm.org/D48705 <https://reviews.llvm.org/D48705> ), so we should move forward since we've approximated/justified the cost and benefits.

Let's settle on the intrinsic definition(s).

1. There was a suggestion to generalize rotate to a "valign" or "double shift" (that's what x86 means with its poorly worded "double precision shift"). How does that work with vector types? The options are a full vector-size shift or a per-element shift. If it's the full vector, do we still want/need a specialized rotate intrinsic for per-element? If it's per-element, do we still want/need the other form for a full vector?

2. What is the behavior for a shift/rotate amount that is equal or greater than the bit-width of the operand (or the bit width of a vector element type?)? Can we modulo that operand by the bit width, or does that not map well to the hardware semantics?

On Thu, May 17, 2018 at 5:23 PM, John Regehr <[hidden email] <mailto:[hidden email]>> wrote:

    Thanks Sanjay!

    At this point the cost/benefit tradeoff for rotate intrinsics
    seems pretty good.

    John


    On 05/17/2018 11:14 AM, Sanjay Patel wrote:

        A rotate intrinsic should be relatively close in
        cost/complexity to the existing bswap.

        A grep of intrinsic::bswap says we'd probably add code in:
        InstCombine
        InstructionSimplify
        ConstantFolding
        DemandedBits
        ValueTracking
        VectorUtils
        SelectionDAGBuilder

        But I don't think it's fair to view those additions as pure
        added cost. As an example, consider that we have to add hacks
        to EarlyCSE to recognize multi-IR-instruction min/max/abs
        patterns. Intrinsics just work as-is there. So if you search
        for 'matchSelectPattern', you get an idea (I see 32 hits in 10
        files) of the cost of *not* having intrinsics for those
        operations that we've decided are not worthy of intrinsics.


        On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev
        <[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email]

        <mailto:[hidden email]>>> wrote:

            On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:

                An informal metric might be: if the operation is
        supported as a
                primitive op or built-in in source languages and it is
        supported
                as a single target instruction, can we guarantee that
        1-to-1
                translation through optimization?


            It seems perfectly reasonable for LLVM users to expect this to
            happen reliably.

            I'd like to take a look at the other side of the equation:
        the cost
            of adding a new intrinsic in terms of teaching passes to
        see through
            it, so we don't lose optimizations that worked before the
        intrinsic
            was added.

            For example, clearly ValueTracking needs a few lines added
        so that
            computeKnownBits and friends don't get stopped by a
        rotate. Anyone
            have a reasonably complete list of files that need similar
        changes?

            John

            _______________________________________________
            LLVM Developers mailing list
        [hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>
        http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
        <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
            <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
        <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>






_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
On 7/2/2018 3:16 PM, Sanjay Patel wrote:

> I also agree that the per-element rotate for vectors is what we want for
> this intrinsic.
>
> So I have this so far:
>
> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x  i32>  %a, <2 x i32> %b, <2 x i32> %shift_amount)
>
> For scalars, @llvm.catshift concatenates %a and %b, shifts the
> concatenated value right by the number of bits specified by
> %shift_amount modulo the bit-width, and truncates to the original
> bit-width.
> For vectors, that operation occurs for each element of the vector:
>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
> If %a == %b, this is equivalent to a bitwise rotate right. Rotate left
> may be implemented by subtracting the shift amount from the bit-width of
> the scalar type or vector element type.

Or just negating, iff the shift amount is defined to be modulo and the
machine is two's complement.

I'm a bit worried that while modulo is the Obviously Right Thing for
rotates, the situation is less clear for general funnel shifts.

I looked over some of the ISAs I have docs at hand for:

- x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants.
Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b
instruction variants). As of BMI2, we also get RORX (non-flag-setting
ROR) but no ROLX.

- ARM AArch64 has EXTR, which is a right funnel shift, but shift
distances must be literal constants. EXTR with both source registers
equal disassembles as ROR and is often special-cased in implementations.
(EXTR with source 1 != source 2 often has an extra cycle of latency).
There is RORV which is right rotate by a variable (register) amount;
there is no EXTRV.

- NVPTX has SHF
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
with both left/right shift variants and with both "clamp" (clamps shift
count at 32) and "wrap" (shift count taken mod 32) modes.

- GCN has v_alignbit_b32 which is a right funnel shift, and it seems to
be defined to take shift distances mod 32.

based on that sampling, modulo behavior seems like a good choice for a
generic IR instruction, and if you're going to pick one direction, right
shifts are the one to use. Not sure about other ISAs.

-Fabian
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev


On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:

> On 7/2/2018 3:16 PM, Sanjay Patel wrote:
>> I also agree that the per-element rotate for vectors is what we want
>> for this intrinsic.
>>
>> So I have this so far:
>>
>> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
>> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x i32>  %a, <2 x i32>
>> %b, <2 x i32> %shift_amount)
>>
>> For scalars, @llvm.catshift concatenates %a and %b, shifts the
>> concatenated value right by the number of bits specified by
>> %shift_amount modulo the bit-width, and truncates to the original
>> bit-width.
>> For vectors, that operation occurs for each element of the vector:
>>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
>> If %a == %b, this is equivalent to a bitwise rotate right. Rotate
>> left may be implemented by subtracting the shift amount from the
>> bit-width of the scalar type or vector element type.
>
> Or just negating, iff the shift amount is defined to be modulo and the
> machine is two's complement.
>
> I'm a bit worried that while modulo is the Obviously Right Thing for
> rotates, the situation is less clear for general funnel shifts.
>
> I looked over some of the ISAs I have docs at hand for:
>
> - x86 (32b/64b variants) has SHRD/SHLD, so both right and left
> variants. Count is modulo (mod 32 for 32b instruction variants, mod 64
> for 64b instruction variants). As of BMI2, we also get RORX
> (non-flag-setting ROR) but no ROLX.
>
> - ARM AArch64 has EXTR, which is a right funnel shift, but shift
> distances must be literal constants. EXTR with both source registers
> equal disassembles as ROR and is often special-cased in
> implementations. (EXTR with source 1 != source 2 often has an extra
> cycle of latency). There is RORV which is right rotate by a variable
> (register) amount; there is no EXTRV.
>
> - NVPTX has SHF
> (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
> with both left/right shift variants and with both "clamp" (clamps
> shift count at 32) and "wrap" (shift count taken mod 32) modes.
>
> - GCN has v_alignbit_b32 which is a right funnel shift, and it seems
> to be defined to take shift distances mod 32.
>
> based on that sampling, modulo behavior seems like a good choice for a
> generic IR instruction, and if you're going to pick one direction,
> right shifts are the one to use. Not sure about other ISAs.
>
> -Fabian
Sorry for the late reply to this thread, I'd like to mention that the
existing ISD ROTL/ROTR opcodes currently do not properly assume modulo
behaviour so that definition would need to be tidied up and made
explicit; the recent legalization code might need fixing as well. Are
you intending to add CONCATSHL/CONCATSRL ISD opcodes as well?

Additionally the custom SSE lowering that I added doesn't assume modulo
(although I think the vXi8 lowering might work already), and it only
lowers for ROTL at the moment (mainly due to a legacy of how the XOP
instructions work), but adding ROTR support shouldn't be difficult.

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
Yes, if we're going to define this as the more general 2-operand funnel shift, then we might as well add the matching DAG defs and adjust the existing ROTL/ROTR defs to match the modulo.

I hadn't heard the "funnel shift" terminology before, but let's go with that because it's more descriptive/accurate than concat+shift.

We will need to define both left and right variants. Otherwise, we risk losing the negation/subtraction of the shift amount via other transforms and defeat the point of defining the full operation.

A few more examples to add to Fabian's:
 - x86 AVX512 added vprol* / vpror* instructions for 32/64-bit element vector types with constant and variable rotate amounts. The "count operand modulo the data size (32 or 64) is used".

- PowerPC defined scalar rotates with 'rl*' (everything is based on rotating left). Similarly, Altivec only has 'vrl*' instructions for vectors and all ops rotate modulo the element size. The funnel op is called "vsldoi". So again, it goes left.



On Sun, Jul 8, 2018 at 7:23 AM, Simon Pilgrim <[hidden email]> wrote:


On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:
On 7/2/2018 3:16 PM, Sanjay Patel wrote:
I also agree that the per-element rotate for vectors is what we want for this intrinsic.

So I have this so far:

declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x i32>  %a, <2 x i32> %b, <2 x i32> %shift_amount)

For scalars, @llvm.catshift concatenates %a and %b, shifts the concatenated value right by the number of bits specified by %shift_amount modulo the bit-width, and truncates to the original bit-width.
For vectors, that operation occurs for each element of the vector:
    result[i] = trunc(concat(a[i], b[i]) >> c[i])
If %a == %b, this is equivalent to a bitwise rotate right. Rotate left may be implemented by subtracting the shift amount from the bit-width of the scalar type or vector element type.

Or just negating, iff the shift amount is defined to be modulo and the machine is two's complement.

I'm a bit worried that while modulo is the Obviously Right Thing for rotates, the situation is less clear for general funnel shifts.

I looked over some of the ISAs I have docs at hand for:

- x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b instruction variants). As of BMI2, we also get RORX (non-flag-setting ROR) but no ROLX.

- ARM AArch64 has EXTR, which is a right funnel shift, but shift distances must be literal constants. EXTR with both source registers equal disassembles as ROR and is often special-cased in implementations. (EXTR with source 1 != source 2 often has an extra cycle of latency). There is RORV which is right rotate by a variable (register) amount; there is no EXTRV.

- NVPTX has SHF (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) with both left/right shift variants and with both "clamp" (clamps shift count at 32) and "wrap" (shift count taken mod 32) modes.

- GCN has v_alignbit_b32 which is a right funnel shift, and it seems to be defined to take shift distances mod 32.

based on that sampling, modulo behavior seems like a good choice for a generic IR instruction, and if you're going to pick one direction, right shifts are the one to use. Not sure about other ISAs.

-Fabian
Sorry for the late reply to this thread, I'd like to mention that the existing ISD ROTL/ROTR opcodes currently do not properly assume modulo behaviour so that definition would need to be tidied up and made explicit; the recent legalization code might need fixing as well. Are you intending to add CONCATSHL/CONCATSRL ISD opcodes as well?

Additionally the custom SSE lowering that I added doesn't assume modulo (although I think the vXi8 lowering might work already), and it only lowers for ROTL at the moment (mainly due to a legacy of how the XOP instructions work), but adding ROTR support shouldn't be difficult.



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Rotates, once again

Bruce Hoult via llvm-dev
Initial patch proposal:

I tried to add anyone that replied to this thread as a potential reviewer. Please add yourself if I missed you.

On Sun, Jul 8, 2018 at 9:23 AM, Sanjay Patel <[hidden email]> wrote:
Yes, if we're going to define this as the more general 2-operand funnel shift, then we might as well add the matching DAG defs and adjust the existing ROTL/ROTR defs to match the modulo.

I hadn't heard the "funnel shift" terminology before, but let's go with that because it's more descriptive/accurate than concat+shift.

We will need to define both left and right variants. Otherwise, we risk losing the negation/subtraction of the shift amount via other transforms and defeat the point of defining the full operation.

A few more examples to add to Fabian's:
 - x86 AVX512 added vprol* / vpror* instructions for 32/64-bit element vector types with constant and variable rotate amounts. The "count operand modulo the data size (32 or 64) is used".

- PowerPC defined scalar rotates with 'rl*' (everything is based on rotating left). Similarly, Altivec only has 'vrl*' instructions for vectors and all ops rotate modulo the element size. The funnel op is called "vsldoi". So again, it goes left.



On Sun, Jul 8, 2018 at 7:23 AM, Simon Pilgrim <[hidden email]> wrote:


On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:
On 7/2/2018 3:16 PM, Sanjay Patel wrote:
I also agree that the per-element rotate for vectors is what we want for this intrinsic.

So I have this so far:

declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x i32>  %a, <2 x i32> %b, <2 x i32> %shift_amount)

For scalars, @llvm.catshift concatenates %a and %b, shifts the concatenated value right by the number of bits specified by %shift_amount modulo the bit-width, and truncates to the original bit-width.
For vectors, that operation occurs for each element of the vector:
    result[i] = trunc(concat(a[i], b[i]) >> c[i])
If %a == %b, this is equivalent to a bitwise rotate right. Rotate left may be implemented by subtracting the shift amount from the bit-width of the scalar type or vector element type.

Or just negating, iff the shift amount is defined to be modulo and the machine is two's complement.

I'm a bit worried that while modulo is the Obviously Right Thing for rotates, the situation is less clear for general funnel shifts.

I looked over some of the ISAs I have docs at hand for:

- x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b instruction variants). As of BMI2, we also get RORX (non-flag-setting ROR) but no ROLX.

- ARM AArch64 has EXTR, which is a right funnel shift, but shift distances must be literal constants. EXTR with both source registers equal disassembles as ROR and is often special-cased in implementations. (EXTR with source 1 != source 2 often has an extra cycle of latency). There is RORV which is right rotate by a variable (register) amount; there is no EXTRV.

- NVPTX has SHF (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) with both left/right shift variants and with both "clamp" (clamps shift count at 32) and "wrap" (shift count taken mod 32) modes.

- GCN has v_alignbit_b32 which is a right funnel shift, and it seems to be defined to take shift distances mod 32.

based on that sampling, modulo behavior seems like a good choice for a generic IR instruction, and if you're going to pick one direction, right shifts are the one to use. Not sure about other ISAs.

-Fabian
Sorry for the late reply to this thread, I'd like to mention that the existing ISD ROTL/ROTR opcodes currently do not properly assume modulo behaviour so that definition would need to be tidied up and made explicit; the recent legalization code might need fixing as well. Are you intending to add CONCATSHL/CONCATSRL ISD opcodes as well?

Additionally the custom SSE lowering that I added doesn't assume modulo (although I think the vXi8 lowering might work already), and it only lowers for ROTL at the moment (mainly due to a legacy of how the XOP instructions work), but adding ROTR support shouldn't be difficult.




_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev