[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
Hi all,

I've been looking into how to implement the more advanced Shader Model
6 reduction operations in radv (and obviously most of the work would
be useful for radeonsi too). They're explained in the spec for
GL_AMD_shader_ballot at
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
but I'll summarize them here. There are two types of operations:
reductions that always return a uniform value, and prefix scan
operations. The reductions can be implemented in terms of the prefix
scan (although in practice I don't think we want to implement them in
exactly the same way), and the concerns are mostly the same, so I'll
focus on the prefix scan operations for now. Given an operation `op'
and an input value `a' (that's really a SIMD array with one value per
invocation, even though it's a scalar value in LLVM), the prefix scan
returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
`op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
work for non-uniform control flow: it simply skips inactive
invocations.

On the LLVM side, I think that we have most of the AMD-specific
low-level shuffle intrinsics implemented that you need to do this, but
I can think of a few concerns/questions. First of all, to implement
the prefix scan, we'll need to do a code sequence that looks like
this, modified from
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
v_foo_f32 with the appropriate operation):

; v0 is the input register
v_mov_b32 v1, v0
v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7

The problem is that the way these instructions use the DPP word isn't
currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
intrinsic, but it isn't enough. For example, take the first
instruction:

v_foo_f32 v1, v0, v1 row_shr:1

What it's doing is shifting v0 right by one within each row and adding
it to v1. v1 stays the same in the first lane of each row, however.
With llvm.amdgcn.mov_dpp, we could try to express it as something like
this, in LLVM-like pseduocode:

%tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
%result = foo %tmp, %input

but this is incorrect. If I'm reading the source correctly, this will
make %tmp garbage in lane 0 (since it just turns into a normal move
with the dpp modifier, and no restrictions on the destination). We
could set bound_ctrl to 0 to work around this, since it will make %tmp
0 in lane 0, but that won't work with operations whose identity is
non-0 like min and max. What we need is something like:

%result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1

where llvm.amdgcn.foo_dpp copies the first argument to the result,
then applies the DPP swizzling to the second argument and does 'foo'
to the second and third arguments. It would mean that we'd have a
separate intrinsic for every operation we care about, but I can't
think of a better way to express it. Is there a better way that
doesn't involve creating an intrinsic for each operation?

Next, there's the fact that this code sequence only works when the
active lanes are densely-packed, but we have to make this work even
when control flow is non-uniform. Essentially, we need to "skip over"
the inactive lanes by setting them to the identity, and then we need
to enable them in the exec mask when doing the reduction to make sure
they pass along the correct result. That is, to handle non-uniform
control flow, we need something like:

invert EXEC
result = identity
set EXEC to ~0
<original code sequence>
restore original EXEC

I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
intrinsic that returns the first argument with inactive lanes set to
the second argument. We'd also need something like WQM to make all the
lanes active during the sequence. But that raises some hairy
requirements for register allocation. For example, in something like:

foo = ...
if (...) {
    bar = minInvocationsInclusiveScanAMD(...)
} else {
    ... = foo;
}

we have to make sure that foo isn't allocated to the same register as
one of the temporaries used inside minInvocationsInclusiveScanAMD(),
though they don't interfere. That's because the implementation of
minInvocationsInclusiveScanAMD() will do funny things with the exec
mask, possibly overwriting foo, if the condition is non-uniform. Or
consider the following:

do {
   bar = minInvocationsInclusiveScanAMD(...);
   // ...
   ... = bar; // last use of bar
  foo = ...;
} while (...);

... = foo;

again, foo and the temporaries used to compute bar can't be assigned
to the same register, even though their live ranges don't intersect,
since minInvocationsInclusiveScanAMD() may overwrite the value of foo
in a previous iteration if the loop exit condition isn't uniform. How
can we express this in the backend? I don't know much about the LLVM
infrastucture, so I'm not sure if it's relatively easy or really hard.

Thanks,

Connor
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
cc some people who have worked on this.

On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:

> Hi all,
>
> I've been looking into how to implement the more advanced Shader Model
> 6 reduction operations in radv (and obviously most of the work would
> be useful for radeonsi too). They're explained in the spec for
> GL_AMD_shader_ballot at
> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
> but I'll summarize them here. There are two types of operations:
> reductions that always return a uniform value, and prefix scan
> operations. The reductions can be implemented in terms of the prefix
> scan (although in practice I don't think we want to implement them in
> exactly the same way), and the concerns are mostly the same, so I'll
> focus on the prefix scan operations for now. Given an operation `op'
> and an input value `a' (that's really a SIMD array with one value per
> invocation, even though it's a scalar value in LLVM), the prefix scan
> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
> work for non-uniform control flow: it simply skips inactive
> invocations.
>
> On the LLVM side, I think that we have most of the AMD-specific
> low-level shuffle intrinsics implemented that you need to do this, but
> I can think of a few concerns/questions. First of all, to implement
> the prefix scan, we'll need to do a code sequence that looks like
> this, modified from
> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
> v_foo_f32 with the appropriate operation):
>
> ; v0 is the input register
> v_mov_b32 v1, v0
> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>
> The problem is that the way these instructions use the DPP word isn't
> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
> intrinsic, but it isn't enough. For example, take the first
> instruction:
>
> v_foo_f32 v1, v0, v1 row_shr:1
>
> What it's doing is shifting v0 right by one within each row and adding
> it to v1. v1 stays the same in the first lane of each row, however.
> With llvm.amdgcn.mov_dpp, we could try to express it as something like
> this, in LLVM-like pseduocode:
>
> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
> %result = foo %tmp, %input
>
> but this is incorrect. If I'm reading the source correctly, this will
> make %tmp garbage in lane 0 (since it just turns into a normal move
> with the dpp modifier, and no restrictions on the destination). We
> could set bound_ctrl to 0 to work around this, since it will make %tmp
> 0 in lane 0, but that won't work with operations whose identity is
> non-0 like min and max. What we need is something like:
>
> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>
> where llvm.amdgcn.foo_dpp copies the first argument to the result,
> then applies the DPP swizzling to the second argument and does 'foo'
> to the second and third arguments. It would mean that we'd have a
> separate intrinsic for every operation we care about, but I can't
> think of a better way to express it. Is there a better way that
> doesn't involve creating an intrinsic for each operation?
>
> Next, there's the fact that this code sequence only works when the
> active lanes are densely-packed, but we have to make this work even
> when control flow is non-uniform. Essentially, we need to "skip over"
> the inactive lanes by setting them to the identity, and then we need
> to enable them in the exec mask when doing the reduction to make sure
> they pass along the correct result. That is, to handle non-uniform
> control flow, we need something like:
>
> invert EXEC
> result = identity
> set EXEC to ~0
> <original code sequence>
> restore original EXEC
>
> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
> intrinsic that returns the first argument with inactive lanes set to
> the second argument. We'd also need something like WQM to make all the
> lanes active during the sequence. But that raises some hairy
> requirements for register allocation. For example, in something like:
>
> foo = ...
> if (...) {
>     bar = minInvocationsInclusiveScanAMD(...)
> } else {
>     ... = foo;
> }
>
> we have to make sure that foo isn't allocated to the same register as
> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
> though they don't interfere. That's because the implementation of
> minInvocationsInclusiveScanAMD() will do funny things with the exec
> mask, possibly overwriting foo, if the condition is non-uniform. Or
> consider the following:
>
> do {
>    bar = minInvocationsInclusiveScanAMD(...);
>    // ...
>    ... = bar; // last use of bar
>   foo = ...;
> } while (...);
>
> ... = foo;
>
> again, foo and the temporaries used to compute bar can't be assigned
> to the same register, even though their live ranges don't intersect,
> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
> in a previous iteration if the loop exit condition isn't uniform. How
> can we express this in the backend? I don't know much about the LLVM
> infrastucture, so I'm not sure if it's relatively easy or really hard.
>
> Thanks,
>
> Connor
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:

> cc some people who have worked on this.
>
> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>> Hi all,
>>
>> I've been looking into how to implement the more advanced Shader Model
>> 6 reduction operations in radv (and obviously most of the work would
>> be useful for radeonsi too). They're explained in the spec for
>> GL_AMD_shader_ballot at
>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>> but I'll summarize them here. There are two types of operations:
>> reductions that always return a uniform value, and prefix scan
>> operations. The reductions can be implemented in terms of the prefix
>> scan (although in practice I don't think we want to implement them in
>> exactly the same way), and the concerns are mostly the same, so I'll
>> focus on the prefix scan operations for now. Given an operation `op'
>> and an input value `a' (that's really a SIMD array with one value per
>> invocation, even though it's a scalar value in LLVM), the prefix scan
>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>> work for non-uniform control flow: it simply skips inactive
>> invocations.
>>
>> On the LLVM side, I think that we have most of the AMD-specific
>> low-level shuffle intrinsics implemented that you need to do this, but
>> I can think of a few concerns/questions. First of all, to implement
>> the prefix scan, we'll need to do a code sequence that looks like
>> this, modified from
>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>> v_foo_f32 with the appropriate operation):
>>
>> ; v0 is the input register
>> v_mov_b32 v1, v0
>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>
>> The problem is that the way these instructions use the DPP word isn't
>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>> intrinsic, but it isn't enough. For example, take the first
>> instruction:
>>
>> v_foo_f32 v1, v0, v1 row_shr:1
>>
>> What it's doing is shifting v0 right by one within each row and adding
>> it to v1. v1 stays the same in the first lane of each row, however.
>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>> this, in LLVM-like pseduocode:
>>
>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>> %result = foo %tmp, %input
>>
>> but this is incorrect. If I'm reading the source correctly, this will
>> make %tmp garbage in lane 0 (since it just turns into a normal move
>> with the dpp modifier, and no restrictions on the destination). We
>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>> 0 in lane 0, but that won't work with operations whose identity is
>> non-0 like min and max. What we need is something like:
>>

Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?

-Tom


>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>
>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>> then applies the DPP swizzling to the second argument and does 'foo'
>> to the second and third arguments. It would mean that we'd have a
>> separate intrinsic for every operation we care about, but I can't
>> think of a better way to express it. Is there a better way that
>> doesn't involve creating an intrinsic for each operation?
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform. Essentially, we need to "skip over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>>
>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>     bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>     ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>    bar = minInvocationsInclusiveScanAMD(...);
>>    // ...
>>    ... = bar; // last use of bar
>>   foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be assigned
>> to the same register, even though their live ranges don't intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform. How
>> can we express this in the backend? I don't know much about the LLVM
>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>
>> Thanks,
>>
>> Connor
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email]> wrote:

> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>> cc some people who have worked on this.
>>
>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>> Hi all,
>>>
>>> I've been looking into how to implement the more advanced Shader Model
>>> 6 reduction operations in radv (and obviously most of the work would
>>> be useful for radeonsi too). They're explained in the spec for
>>> GL_AMD_shader_ballot at
>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>> but I'll summarize them here. There are two types of operations:
>>> reductions that always return a uniform value, and prefix scan
>>> operations. The reductions can be implemented in terms of the prefix
>>> scan (although in practice I don't think we want to implement them in
>>> exactly the same way), and the concerns are mostly the same, so I'll
>>> focus on the prefix scan operations for now. Given an operation `op'
>>> and an input value `a' (that's really a SIMD array with one value per
>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>> work for non-uniform control flow: it simply skips inactive
>>> invocations.
>>>
>>> On the LLVM side, I think that we have most of the AMD-specific
>>> low-level shuffle intrinsics implemented that you need to do this, but
>>> I can think of a few concerns/questions. First of all, to implement
>>> the prefix scan, we'll need to do a code sequence that looks like
>>> this, modified from
>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>> v_foo_f32 with the appropriate operation):
>>>
>>> ; v0 is the input register
>>> v_mov_b32 v1, v0
>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>
>>> The problem is that the way these instructions use the DPP word isn't
>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>> intrinsic, but it isn't enough. For example, take the first
>>> instruction:
>>>
>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>
>>> What it's doing is shifting v0 right by one within each row and adding
>>> it to v1. v1 stays the same in the first lane of each row, however.
>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>> this, in LLVM-like pseduocode:
>>>
>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>> %result = foo %tmp, %input
>>>
>>> but this is incorrect. If I'm reading the source correctly, this will
>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>> with the dpp modifier, and no restrictions on the destination). We
>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>> 0 in lane 0, but that won't work with operations whose identity is
>>> non-0 like min and max. What we need is something like:
>>>
>
> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
> or %input (bound_ctrl = 1)?

Oh, maybe it is... for that to happen the underlying move would need
to have the source and destination constrained to be the same. I
couldn't see that constraint anywhere I looked, but I'm not an expert,
so I may have overlooked it. In any case, that behavior still isn't
what we want if we want to implement the prefix scan operations
efficiently.

Connor

>
> -Tom
>
>
>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>>
>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>> then applies the DPP swizzling to the second argument and does 'foo'
>>> to the second and third arguments. It would mean that we'd have a
>>> separate intrinsic for every operation we care about, but I can't
>>> think of a better way to express it. Is there a better way that
>>> doesn't involve creating an intrinsic for each operation?
>>>
>>> Next, there's the fact that this code sequence only works when the
>>> active lanes are densely-packed, but we have to make this work even
>>> when control flow is non-uniform. Essentially, we need to "skip over"
>>> the inactive lanes by setting them to the identity, and then we need
>>> to enable them in the exec mask when doing the reduction to make sure
>>> they pass along the correct result. That is, to handle non-uniform
>>> control flow, we need something like:
>>>
>>> invert EXEC
>>> result = identity
>>> set EXEC to ~0
>>> <original code sequence>
>>> restore original EXEC
>>>
>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>>> intrinsic that returns the first argument with inactive lanes set to
>>> the second argument. We'd also need something like WQM to make all the
>>> lanes active during the sequence. But that raises some hairy
>>> requirements for register allocation. For example, in something like:
>>>
>>> foo = ...
>>> if (...) {
>>>     bar = minInvocationsInclusiveScanAMD(...)
>>> } else {
>>>     ... = foo;
>>> }
>>>
>>> we have to make sure that foo isn't allocated to the same register as
>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>>> though they don't interfere. That's because the implementation of
>>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>>> consider the following:
>>>
>>> do {
>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>    // ...
>>>    ... = bar; // last use of bar
>>>   foo = ...;
>>> } while (...);
>>>
>>> ... = foo;
>>>
>>> again, foo and the temporaries used to compute bar can't be assigned
>>> to the same register, even though their live ranges don't intersect,
>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>>> in a previous iteration if the loop exit condition isn't uniform. How
>>> can we express this in the backend? I don't know much about the LLVM
>>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>>
>>> Thanks,
>>>
>>> Connor
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On 06/12/2017 08:03 PM, Connor Abbott wrote:

> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email]> wrote:
>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>> cc some people who have worked on this.
>>>
>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>> Hi all,
>>>>
>>>> I've been looking into how to implement the more advanced Shader Model
>>>> 6 reduction operations in radv (and obviously most of the work would
>>>> be useful for radeonsi too). They're explained in the spec for
>>>> GL_AMD_shader_ballot at
>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>> but I'll summarize them here. There are two types of operations:
>>>> reductions that always return a uniform value, and prefix scan
>>>> operations. The reductions can be implemented in terms of the prefix
>>>> scan (although in practice I don't think we want to implement them in
>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>> and an input value `a' (that's really a SIMD array with one value per
>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>> work for non-uniform control flow: it simply skips inactive
>>>> invocations.
>>>>
>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>> I can think of a few concerns/questions. First of all, to implement
>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>> this, modified from
>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>> v_foo_f32 with the appropriate operation):
>>>>
>>>> ; v0 is the input register
>>>> v_mov_b32 v1, v0
>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>
>>>> The problem is that the way these instructions use the DPP word isn't
>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>> intrinsic, but it isn't enough. For example, take the first
>>>> instruction:
>>>>
>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>
>>>> What it's doing is shifting v0 right by one within each row and adding
>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>> this, in LLVM-like pseduocode:
>>>>
>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>> %result = foo %tmp, %input
>>>>
>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>> with the dpp modifier, and no restrictions on the destination). We
>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>> non-0 like min and max. What we need is something like:
>>>>
>>
>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>> or %input (bound_ctrl = 1)?
>
> Oh, maybe it is... for that to happen the underlying move would need
> to have the source and destination constrained to be the same. I
> couldn't see that constraint anywhere I looked, but I'm not an expert,
> so I may have overlooked it. In any case, that behavior still isn't
> what we want if we want to implement the prefix scan operations
> efficiently.
>

Ok, I see what you are saying now.  I think the best option here is to
document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
of V_MOV_B32_dpp where the src and dst are the same register.

-Tom

> Connor
>
>>
>> -Tom
>>
>>
>>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>>>
>>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>>> then applies the DPP swizzling to the second argument and does 'foo'
>>>> to the second and third arguments. It would mean that we'd have a
>>>> separate intrinsic for every operation we care about, but I can't
>>>> think of a better way to express it. Is there a better way that
>>>> doesn't involve creating an intrinsic for each operation?
>>>>
>>>> Next, there's the fact that this code sequence only works when the
>>>> active lanes are densely-packed, but we have to make this work even
>>>> when control flow is non-uniform. Essentially, we need to "skip over"
>>>> the inactive lanes by setting them to the identity, and then we need
>>>> to enable them in the exec mask when doing the reduction to make sure
>>>> they pass along the correct result. That is, to handle non-uniform
>>>> control flow, we need something like:
>>>>
>>>> invert EXEC
>>>> result = identity
>>>> set EXEC to ~0
>>>> <original code sequence>
>>>> restore original EXEC
>>>>
>>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>>>> intrinsic that returns the first argument with inactive lanes set to
>>>> the second argument. We'd also need something like WQM to make all the
>>>> lanes active during the sequence. But that raises some hairy
>>>> requirements for register allocation. For example, in something like:
>>>>
>>>> foo = ...
>>>> if (...) {
>>>>     bar = minInvocationsInclusiveScanAMD(...)
>>>> } else {
>>>>     ... = foo;
>>>> }
>>>>
>>>> we have to make sure that foo isn't allocated to the same register as
>>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>>>> though they don't interfere. That's because the implementation of
>>>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>>>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>>>> consider the following:
>>>>
>>>> do {
>>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>>    // ...
>>>>    ... = bar; // last use of bar
>>>>   foo = ...;
>>>> } while (...);
>>>>
>>>> ... = foo;
>>>>
>>>> again, foo and the temporaries used to compute bar can't be assigned
>>>> to the same register, even though their live ranges don't intersect,
>>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>>>> in a previous iteration if the loop exit condition isn't uniform. How
>>>> can we express this in the backend? I don't know much about the LLVM
>>>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>>>
>>>> Thanks,
>>>>
>>>> Connor
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> [hidden email]
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
In reply to this post by Alex Bradbury via llvm-dev
On 12.06.2017 23:58, Connor Abbott wrote:

> Hi all,
>
> I've been looking into how to implement the more advanced Shader Model
> 6 reduction operations in radv (and obviously most of the work would
> be useful for radeonsi too). They're explained in the spec for
> GL_AMD_shader_ballot at
> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
> but I'll summarize them here. There are two types of operations:
> reductions that always return a uniform value, and prefix scan
> operations. The reductions can be implemented in terms of the prefix
> scan (although in practice I don't think we want to implement them in
> exactly the same way), and the concerns are mostly the same, so I'll
> focus on the prefix scan operations for now. Given an operation `op'
> and an input value `a' (that's really a SIMD array with one value per
> invocation, even though it's a scalar value in LLVM), the prefix scan
> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
> work for non-uniform control flow: it simply skips inactive
> invocations.
>
> On the LLVM side, I think that we have most of the AMD-specific
> low-level shuffle intrinsics implemented that you need to do this, but
> I can think of a few concerns/questions. First of all, to implement
> the prefix scan, we'll need to do a code sequence that looks like
> this, modified from
> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
> v_foo_f32 with the appropriate operation):
>
> ; v0 is the input register
> v_mov_b32 v1, v0
> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>
> The problem is that the way these instructions use the DPP word isn't
> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
> intrinsic, but it isn't enough. For example, take the first
> instruction:
>
> v_foo_f32 v1, v0, v1 row_shr:1
>
> What it's doing is shifting v0 right by one within each row and adding
> it to v1. v1 stays the same in the first lane of each row, however.
> With llvm.amdgcn.mov_dpp, we could try to express it as something like
> this, in LLVM-like pseduocode:
>
> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
> %result = foo %tmp, %input
>
> but this is incorrect. If I'm reading the source correctly, this will
> make %tmp garbage in lane 0 (since it just turns into a normal move
> with the dpp modifier, and no restrictions on the destination). We
> could set bound_ctrl to 0 to work around this, since it will make %tmp
> 0 in lane 0, but that won't work with operations whose identity is
> non-0 like min and max. What we need is something like:
>
> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>
> where llvm.amdgcn.foo_dpp copies the first argument to the result,
> then applies the DPP swizzling to the second argument and does 'foo'
> to the second and third arguments. It would mean that we'd have a
> separate intrinsic for every operation we care about, but I can't
> think of a better way to express it. Is there a better way that
> doesn't involve creating an intrinsic for each operation?
>
> Next, there's the fact that this code sequence only works when the
> active lanes are densely-packed, but we have to make this work even
> when control flow is non-uniform.
 >

> Essentially, we need to "skip over"
> the inactive lanes by setting them to the identity, and then we need
> to enable them in the exec mask when doing the reduction to make sure
> they pass along the correct result. That is, to handle non-uniform
> control flow, we need something like:
>
> invert EXEC
> result = identity
> set EXEC to ~0
> <original code sequence>
> restore original EXEC

Yeah, this is going to be a pain, mostly in how it could interact with
register spilling.


> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
> intrinsic that returns the first argument with inactive lanes set to
> the second argument. We'd also need something like WQM to make all the
> lanes active during the sequence. But that raises some hairy
> requirements for register allocation. For example, in something like:
>
> foo = ...
> if (...) {
>      bar = minInvocationsInclusiveScanAMD(...)
> } else {
>      ... = foo;
> }
>
> we have to make sure that foo isn't allocated to the same register as
> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
> though they don't interfere. That's because the implementation of
> minInvocationsInclusiveScanAMD() will do funny things with the exec
> mask, possibly overwriting foo, if the condition is non-uniform. Or
> consider the following:
>
> do {
>     bar = minInvocationsInclusiveScanAMD(...);
>     // ...
>     ... = bar; // last use of bar
>    foo = ...;
> } while (...);
>
> ... = foo;
>
> again, foo and the temporaries used to compute bar can't be assigned
> to the same register, even though their live ranges don't intersect,
> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
> in a previous iteration if the loop exit condition isn't uniform. How
> can we express this in the backend? I don't know much about the LLVM
> infrastucture, so I'm not sure if it's relatively easy or really hard.

The actual register allocation is probably comparatively harmless. The
register allocator runs long after control flow structurization, which
means that in your examples above, the register allocator actually sees
that the lifetimes of the relevant variables overlap. Specifically, the
basic-block structure in your first if-based example is actually:

    foo = ...
    cbranch merge
    ; fall-through

if:
    bar = ...
    ; fall-through

merge:
    cbranch endif
    ; fall-through

else:
    ... = foo
    ; fall-through

endif:

... and so foo and bar will not be assigned the same register.

Again, where it gets hairy is with spilling, because the spiller is
hopelessly lost when it comes to understanding EXEC masks...

Cheers,
Nicolai


> Thanks,
>
> Connor
>


--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On Mon, Jun 12, 2017 at 11:23 PM, Nicolai Hähnle <[hidden email]> wrote:

> On 12.06.2017 23:58, Connor Abbott wrote:
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform.
>
>>
>>
>> Essentially, we need to "skip over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>
>
> Yeah, this is going to be a pain, mostly in how it could interact with
> register spilling.
>
>
>
>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>      bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>      ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>     bar = minInvocationsInclusiveScanAMD(...);
>>     // ...
>>     ... = bar; // last use of bar
>>    foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be assigned
>> to the same register, even though their live ranges don't intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform. How
>> can we express this in the backend? I don't know much about the LLVM
>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>
>
> The actual register allocation is probably comparatively harmless. The
> register allocator runs long after control flow structurization, which means
> that in your examples above, the register allocator actually sees that the
> lifetimes of the relevant variables overlap. Specifically, the basic-block
> structure in your first if-based example is actually:
>
>    foo = ...
>    cbranch merge
>    ; fall-through
>
> if:
>    bar = ...
>    ; fall-through
>
> merge:
>    cbranch endif
>    ; fall-through
>
> else:
>    ... = foo
>    ; fall-through
>
> endif:
>
> ... and so foo and bar will not be assigned the same register.

What about my second example? There, the expanded control flow should
be the same as the original control flow, yet we still have the
problem AFAICT.

Also, does this mean that the registers will interfere even if they're
only ever written with non-overlapping exec masks? That seems like a
shame, since in the vast majority of cases you're going to add
artificial extra register pressure and constrain the register
allocator more than necessary...

>
> Again, where it gets hairy is with spilling, because the spiller is
> hopelessly lost when it comes to understanding EXEC masks...

Ok, I'll look into it. Thanks for the hints.

>
> Cheers,
> Nicolai
>
>
>> Thanks,
>>
>> Connor
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
In reply to this post by Alex Bradbury via llvm-dev

On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email]> wrote:

On 06/12/2017 08:03 PM, Connor Abbott wrote:
On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email]> wrote:
On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
cc some people who have worked on this.

On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
Hi all,

I've been looking into how to implement the more advanced Shader Model
6 reduction operations in radv (and obviously most of the work would
be useful for radeonsi too). They're explained in the spec for
GL_AMD_shader_ballot at
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
but I'll summarize them here. There are two types of operations:
reductions that always return a uniform value, and prefix scan
operations. The reductions can be implemented in terms of the prefix
scan (although in practice I don't think we want to implement them in
exactly the same way), and the concerns are mostly the same, so I'll
focus on the prefix scan operations for now. Given an operation `op'
and an input value `a' (that's really a SIMD array with one value per
invocation, even though it's a scalar value in LLVM), the prefix scan
returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
`op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
work for non-uniform control flow: it simply skips inactive
invocations.

On the LLVM side, I think that we have most of the AMD-specific
low-level shuffle intrinsics implemented that you need to do this, but
I can think of a few concerns/questions. First of all, to implement
the prefix scan, we'll need to do a code sequence that looks like
this, modified from
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
v_foo_f32 with the appropriate operation):

; v0 is the input register
v_mov_b32 v1, v0
v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7

The problem is that the way these instructions use the DPP word isn't
currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
intrinsic, but it isn't enough. For example, take the first
instruction:

v_foo_f32 v1, v0, v1 row_shr:1

What it's doing is shifting v0 right by one within each row and adding
it to v1. v1 stays the same in the first lane of each row, however.
With llvm.amdgcn.mov_dpp, we could try to express it as something like
this, in LLVM-like pseduocode:

%tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
%result = foo %tmp, %input

but this is incorrect. If I'm reading the source correctly, this will
make %tmp garbage in lane 0 (since it just turns into a normal move
with the dpp modifier, and no restrictions on the destination). We
could set bound_ctrl to 0 to work around this, since it will make %tmp
0 in lane 0, but that won't work with operations whose identity is
non-0 like min and max. What we need is something like:


Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?

Oh, maybe it is... for that to happen the underlying move would need
to have the source and destination constrained to be the same. I
couldn't see that constraint anywhere I looked, but I'm not an expert,
so I may have overlooked it. In any case, that behavior still isn't
what we want if we want to implement the prefix scan operations
efficiently.


Ok, I see what you are saying now.  I think the best option here is to
document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
of V_MOV_B32_dpp where the src and dst are the same register.

-Tom

This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.

-Matt


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On 06/13/2017 07:33 PM, Matt Arsenault wrote:

>
>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>> cc some people who have worked on this.
>>>>>
>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>> GL_AMD_shader_ballot at
>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>> invocations.
>>>>>>
>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>> this, modified from
>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>
>>>>>> ; v0 is the input register
>>>>>> v_mov_b32 v1, v0
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>
>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>> instruction:
>>>>>>
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>
>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>> this, in LLVM-like pseduocode:
>>>>>>
>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>> %result = foo %tmp, %input
>>>>>>
>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>> non-0 like min and max. What we need is something like:
>>>>>>
>>>>
>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>> or %input (bound_ctrl = 1)?
>>>
>>> Oh, maybe it is... for that to happen the underlying move would need
>>> to have the source and destination constrained to be the same. I
>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>> so I may have overlooked it. In any case, that behavior still isn't
>>> what we want if we want to implement the prefix scan operations
>>> efficiently.
>>>
>>
>> Ok, I see what you are saying now.  I think the best option here is to
>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>> of V_MOV_B32_dpp where the src and dst are the same register.
>>
>> -Tom
>
> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>

I think re-defining the existing intrinsic will be easier than adding a new
one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
for the bound_ctrl=1 case, which isn't really ideal.

-Tom

> -Matt
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
Sorry about the formatting...

Anyway, I think there may be a misinterpretation of bound_cntl.  My understanding is that:

0 => if the source is invalid or disabled, do not write a result
1 => if the source is invalid or disabled, use a 0 instead

So the problematic case is where bound_cntl is 0, not when it is 1.

-----Original Message-----
From: Tom Stellard [mailto:[hidden email]]
Sent: Tuesday, June 13, 2017 6:13 PM
To: Matt Arsenault
Cc: Connor Abbott; [hidden email]; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/13/2017 07:33 PM, Matt Arsenault wrote:

>
>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>> cc some people who have worked on this.
>>>>>
>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I've been looking into how to implement the more advanced Shader
>>>>>> Model
>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>> would be useful for radeonsi too). They're explained in the spec
>>>>>> for GL_AMD_shader_ballot at
>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader
>>>>>> _ballot.txt, but I'll summarize them here. There are two types of
>>>>>> operations:
>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>> operations. The reductions can be implemented in terms of the
>>>>>> prefix scan (although in practice I don't think we want to
>>>>>> implement them in exactly the same way), and the concerns are
>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>> and an input value `a' (that's really a SIMD array with one value
>>>>>> per invocation, even though it's a scalar value in LLVM), the
>>>>>> prefix scan returns a[0] in invocation 0, a[0] `op' a[1] in
>>>>>> invocation 1, a[0] `op' a[1] `op' a[2] in invocation 2, etc. The
>>>>>> prefix scan will also work for non-uniform control flow: it
>>>>>> simply skips inactive invocations.
>>>>>>
>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>> low-level shuffle intrinsics implemented that you need to do
>>>>>> this, but I can think of a few concerns/questions. First of all,
>>>>>> to implement the prefix scan, we'll need to do a code sequence
>>>>>> that looks like this, modified from
>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>> (replace
>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>
>>>>>> ; v0 is the input register
>>>>>> v_mov_b32 v1, v0
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add two
>>>>>> independent instructions to avoid a data hazard v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>
>>>>>> The problem is that the way these instructions use the DPP word
>>>>>> isn't currently expressible in LLVM. We have the
>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For example,
>>>>>> take the first
>>>>>> instruction:
>>>>>>
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>
>>>>>> What it's doing is shifting v0 right by one within each row and
>>>>>> adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something
>>>>>> like this, in LLVM-like pseduocode:
>>>>>>
>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>> %tmp, %input
>>>>>>
>>>>>> but this is incorrect. If I'm reading the source correctly, this
>>>>>> will make %tmp garbage in lane 0 (since it just turns into a
>>>>>> normal move with the dpp modifier, and no restrictions on the
>>>>>> destination). We could set bound_ctrl to 0 to work around this,
>>>>>> since it will make %tmp
>>>>>> 0 in lane 0, but that won't work with operations whose identity
>>>>>> is
>>>>>> non-0 like min and max. What we need is something like:
>>>>>>
>>>>
>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl
>>>> =0) or %input (bound_ctrl = 1)?
>>>
>>> Oh, maybe it is... for that to happen the underlying move would need
>>> to have the source and destination constrained to be the same. I
>>> couldn't see that constraint anywhere I looked, but I'm not an
>>> expert, so I may have overlooked it. In any case, that behavior
>>> still isn't what we want if we want to implement the prefix scan
>>> operations efficiently.
>>>
>>
>> Ok, I see what you are saying now.  I think the best option here is
>> to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is
>> to copy its src operand to dst when bound_ctrl = 1 and it reads from
>> an invalid thread, and then when bound_ctrl=1, lower the intrinsic to
>> a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>
>> -Tom
>
> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>

I think re-defining the existing intrinsic will be easier than adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken for the bound_ctrl=1 case, which isn't really ideal.

-Tom

> -Matt
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On 06/14/2017 10:33 AM, Sumner, Brian wrote:
> Sorry about the formatting...
>
> Anyway, I think there may be a misinterpretation of bound_cntl.  My understanding is that:
>
> 0 => if the source is invalid or disabled, do not write a result
> 1 => if the source is invalid or disabled, use a 0 instead
>
> So the problematic case is where bound_cntl is 0, not when it is 1.
>

Yes, you are right, I had that mixed up.

-Tom

> -----Original Message-----
> From: Tom Stellard [mailto:[hidden email]]
> Sent: Tuesday, June 13, 2017 6:13 PM
> To: Matt Arsenault
> Cc: Connor Abbott; [hidden email]; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
>
> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>
>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>> cc some people who have worked on this.
>>>>>>
>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been looking into how to implement the more advanced Shader
>>>>>>> Model
>>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>>> would be useful for radeonsi too). They're explained in the spec
>>>>>>> for GL_AMD_shader_ballot at
>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader
>>>>>>> _ballot.txt, but I'll summarize them here. There are two types of
>>>>>>> operations:
>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>> operations. The reductions can be implemented in terms of the
>>>>>>> prefix scan (although in practice I don't think we want to
>>>>>>> implement them in exactly the same way), and the concerns are
>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>> and an input value `a' (that's really a SIMD array with one value
>>>>>>> per invocation, even though it's a scalar value in LLVM), the
>>>>>>> prefix scan returns a[0] in invocation 0, a[0] `op' a[1] in
>>>>>>> invocation 1, a[0] `op' a[1] `op' a[2] in invocation 2, etc. The
>>>>>>> prefix scan will also work for non-uniform control flow: it
>>>>>>> simply skips inactive invocations.
>>>>>>>
>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>> low-level shuffle intrinsics implemented that you need to do
>>>>>>> this, but I can think of a few concerns/questions. First of all,
>>>>>>> to implement the prefix scan, we'll need to do a code sequence
>>>>>>> that looks like this, modified from
>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>>> (replace
>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>
>>>>>>> ; v0 is the input register
>>>>>>> v_mov_b32 v1, v0
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add two
>>>>>>> independent instructions to avoid a data hazard v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>
>>>>>>> The problem is that the way these instructions use the DPP word
>>>>>>> isn't currently expressible in LLVM. We have the
>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For example,
>>>>>>> take the first
>>>>>>> instruction:
>>>>>>>
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>
>>>>>>> What it's doing is shifting v0 right by one within each row and
>>>>>>> adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something
>>>>>>> like this, in LLVM-like pseduocode:
>>>>>>>
>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>>> %tmp, %input
>>>>>>>
>>>>>>> but this is incorrect. If I'm reading the source correctly, this
>>>>>>> will make %tmp garbage in lane 0 (since it just turns into a
>>>>>>> normal move with the dpp modifier, and no restrictions on the
>>>>>>> destination). We could set bound_ctrl to 0 to work around this,
>>>>>>> since it will make %tmp
>>>>>>> 0 in lane 0, but that won't work with operations whose identity
>>>>>>> is
>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>
>>>>>
>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl
>>>>> =0) or %input (bound_ctrl = 1)?
>>>>
>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>> to have the source and destination constrained to be the same. I
>>>> couldn't see that constraint anywhere I looked, but I'm not an
>>>> expert, so I may have overlooked it. In any case, that behavior
>>>> still isn't what we want if we want to implement the prefix scan
>>>> operations efficiently.
>>>>
>>>
>>> Ok, I see what you are saying now.  I think the best option here is
>>> to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is
>>> to copy its src operand to dst when bound_ctrl = 1 and it reads from
>>> an invalid thread, and then when bound_ctrl=1, lower the intrinsic to
>>> a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>
>>> -Tom
>>
>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>
>
> I think re-defining the existing intrinsic will be easier than adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>
> -Tom
>
>> -Matt
>>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
In reply to this post by Alex Bradbury via llvm-dev
On 06/14/2017 10:33 AM, Sumner, Brian wrote:
> Sorry about the formatting...
>
> Anyway, I think there may be a misinterpretation of bound_cntl.  My understanding is that:
>
> 0 => if the source is invalid or disabled, do not write a result
> 1 => if the source is invalid or disabled, use a 0 instead
>
> So the problematic case is where bound_cntl is 0, not when it is 1.
>

Yes, you are right, I had that mixed up.

-Tom

> -----Original Message-----
> From: Tom Stellard [mailto:[hidden email]]
> Sent: Tuesday, June 13, 2017 6:13 PM
> To: Matt Arsenault
> Cc: Connor Abbott; [hidden email]; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
>
> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>
>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>> cc some people who have worked on this.
>>>>>>
>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been looking into how to implement the more advanced Shader
>>>>>>> Model
>>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>>> would be useful for radeonsi too). They're explained in the spec
>>>>>>> for GL_AMD_shader_ballot at
>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader
>>>>>>> _ballot.txt, but I'll summarize them here. There are two types of
>>>>>>> operations:
>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>> operations. The reductions can be implemented in terms of the
>>>>>>> prefix scan (although in practice I don't think we want to
>>>>>>> implement them in exactly the same way), and the concerns are
>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>> and an input value `a' (that's really a SIMD array with one value
>>>>>>> per invocation, even though it's a scalar value in LLVM), the
>>>>>>> prefix scan returns a[0] in invocation 0, a[0] `op' a[1] in
>>>>>>> invocation 1, a[0] `op' a[1] `op' a[2] in invocation 2, etc. The
>>>>>>> prefix scan will also work for non-uniform control flow: it
>>>>>>> simply skips inactive invocations.
>>>>>>>
>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>> low-level shuffle intrinsics implemented that you need to do
>>>>>>> this, but I can think of a few concerns/questions. First of all,
>>>>>>> to implement the prefix scan, we'll need to do a code sequence
>>>>>>> that looks like this, modified from
>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>>> (replace
>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>
>>>>>>> ; v0 is the input register
>>>>>>> v_mov_b32 v1, v0
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add two
>>>>>>> independent instructions to avoid a data hazard v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>
>>>>>>> The problem is that the way these instructions use the DPP word
>>>>>>> isn't currently expressible in LLVM. We have the
>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For example,
>>>>>>> take the first
>>>>>>> instruction:
>>>>>>>
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>
>>>>>>> What it's doing is shifting v0 right by one within each row and
>>>>>>> adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something
>>>>>>> like this, in LLVM-like pseduocode:
>>>>>>>
>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>>> %tmp, %input
>>>>>>>
>>>>>>> but this is incorrect. If I'm reading the source correctly, this
>>>>>>> will make %tmp garbage in lane 0 (since it just turns into a
>>>>>>> normal move with the dpp modifier, and no restrictions on the
>>>>>>> destination). We could set bound_ctrl to 0 to work around this,
>>>>>>> since it will make %tmp
>>>>>>> 0 in lane 0, but that won't work with operations whose identity
>>>>>>> is
>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>
>>>>>
>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl
>>>>> =0) or %input (bound_ctrl = 1)?
>>>>
>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>> to have the source and destination constrained to be the same. I
>>>> couldn't see that constraint anywhere I looked, but I'm not an
>>>> expert, so I may have overlooked it. In any case, that behavior
>>>> still isn't what we want if we want to implement the prefix scan
>>>> operations efficiently.
>>>>
>>>
>>> Ok, I see what you are saying now.  I think the best option here is
>>> to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is
>>> to copy its src operand to dst when bound_ctrl = 1 and it reads from
>>> an invalid thread, and then when bound_ctrl=1, lower the intrinsic to
>>> a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>
>>> -Tom
>>
>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>
>
> I think re-defining the existing intrinsic will be easier than adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>
> -Tom
>
>> -Matt
>>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
In reply to this post by Alex Bradbury via llvm-dev
On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <[hidden email]> wrote:

> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>
>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>> cc some people who have worked on this.
>>>>>>
>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>> GL_AMD_shader_ballot at
>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>> invocations.
>>>>>>>
>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>> this, modified from
>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>
>>>>>>> ; v0 is the input register
>>>>>>> v_mov_b32 v1, v0
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>
>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>> instruction:
>>>>>>>
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>
>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>
>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>> %result = foo %tmp, %input
>>>>>>>
>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>
>>>>>
>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>> or %input (bound_ctrl = 1)?
>>>>
>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>> to have the source and destination constrained to be the same. I
>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>> what we want if we want to implement the prefix scan operations
>>>> efficiently.
>>>>
>>>
>>> Ok, I see what you are saying now.  I think the best option here is to
>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>
>>> -Tom
>>
>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>
>
> I think re-defining the existing intrinsic will be easier than adding a new
> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
> for the bound_ctrl=1 case, which isn't really ideal.

You can't re-define the existing intrinsic. The problem is that a move
with DPP doesn't just write to its destination register - it modifies
it, i.e. it reads and then writes to it. You can't express that with
SSA, since you can only ever write to an SSA value once. I think what
you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
call it?) that takes the value to use if the read is invalid. That is,
something like:

%new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)

would get turned into:

v_mov_b32 %new, %old
v_mov_b32 %new, %update (dpp control)

And hopefully coalescing will remove the first copy. Think of it as
like lowering LLVM's three-address form for e.g. ADD to the x86
two-address form. If you wanted to get fancy, you could also recognize
the case where the old value is zero and turn that into a DPP move
with bound_ctrl = 1.

Also, remember that the whole point of this was to be able to express
stuff like:

v_min_f32 v1, v0, v1 (dpp control)

where you take the minimum of v1 and the swizzled v0, except where you
would've read an invalid lane for v0, you read the old value for v1
instead. For operations like add and iadd where the identity is 0, you
can set bound_ctrl = 1, and then the optimizer can safely fold the
v_mov_b32 into the operation itself. That is, you'd do:

%swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
%new = i32 add %swizzled, %old

and after coalescing, register allocation, etc. the backend would turn
that into:

v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1

which is functionally equivalent to the version without the bound_ctrl.

Otherwise, for operations `op' where a `op' a == a, you can do something like:

%swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
%new = f32 fmin %swizzled, %old

and then if the optimizer is clever enough, it can also fold it into
one instruction since the invalid lanes will get their old values
back. I think that covers all the cases we care about. The question is
whether it's better to take that route, or whether it's better to just
add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
that:

%new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)

turns into:

v_mov_b32 %new, %old
v_min_f32 %new, %src0, %src1 (dpp_control)

and then we can get what we want directly, without have to do much
optimization except for coalescing that already exists. The downside
is that we'd have to add a lot more intrinsics (one for min, max,
fmin, fmax, add, and iadd).

>
> -Tom
>
>> -Matt
>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On 06/14/2017 05:05 PM, Connor Abbott wrote:

> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <[hidden email]> wrote:
>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>
>>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>
>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>> cc some people who have worked on this.
>>>>>>>
>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>>> GL_AMD_shader_ballot at
>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>>> invocations.
>>>>>>>>
>>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>>> this, modified from
>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>
>>>>>>>> ; v0 is the input register
>>>>>>>> v_mov_b32 v1, v0
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>>
>>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>>> instruction:
>>>>>>>>
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>
>>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>>
>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>>> %result = foo %tmp, %input
>>>>>>>>
>>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>
>>>>>>
>>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>>> or %input (bound_ctrl = 1)?
>>>>>
>>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>>> to have the source and destination constrained to be the same. I
>>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>>> what we want if we want to implement the prefix scan operations
>>>>> efficiently.
>>>>>
>>>>
>>>> Ok, I see what you are saying now.  I think the best option here is to
>>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>>
>>>> -Tom
>>>
>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>
>>
>> I think re-defining the existing intrinsic will be easier than adding a new
>> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
>> for the bound_ctrl=1 case, which isn't really ideal.
>
> You can't re-define the existing intrinsic. The problem is that a move
> with DPP doesn't just write to its destination register - it modifies
> it, i.e. it reads and then writes to it. You can't express that with
> SSA, since you can only ever write to an SSA value once. I think what
> you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
> call it?) that takes the value to use if the read is invalid. That is,
> something like:
>
> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>

This is a more general form of what I'm suggesting.  If you re-define
(when I say re-define, I just mean document this behavior)
llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value
of %src will be written to %dst for invalid lanes, then this will
have the same effect as llvm.amdgcn.update.dpp when %old = %update.

I can see the usefulness now of having this extra intrinsic,
but I still think it would be good to fix llvm.amdgcn.mov.dpp.
At the MachineIR level, I think you could handle the new intrinsic
and llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same
pseudo instructions, so fixing llvm.amdgcn.mov.dpp may not
be so hard if you are adding the new intrinsic.


> would get turned into:
>
> v_mov_b32 %new, %old
> v_mov_b32 %new, %update (dpp control)
>
> And hopefully coalescing will remove the first copy. Think of it as
> like lowering LLVM's three-address form for e.g. ADD to the x86
> two-address form. If you wanted to get fancy, you could also recognize
> the case where the old value is zero and turn that into a DPP move
> with bound_ctrl = 1.
>
> Also, remember that the whole point of this was to be able to express
> stuff like:
>
> v_min_f32 v1, v0, v1 (dpp control)
>
> where you take the minimum of v1 and the swizzled v0, except where you
> would've read an invalid lane for v0, you read the old value for v1
> instead. For operations like add and iadd where the identity is 0, you
> can set bound_ctrl = 1, and then the optimizer can safely fold the
> v_mov_b32 into the operation itself. That is, you'd do:
>
> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
> %new = i32 add %swizzled, %old
>
> and after coalescing, register allocation, etc. the backend would turn
> that into:
>
> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>
> which is functionally equivalent to the version without the bound_ctrl.
>
> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>
> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
> %new = f32 fmin %swizzled, %old
>
> and then if the optimizer is clever enough, it can also fold it into
> one instruction since the invalid lanes will get their old values
> back. I think that covers all the cases we care about. The question is
> whether it's better to take that route, or whether it's better to just
> add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
> that:
>

I would only go the route of adding a dpp intrinsic for every operation
if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
The main reason for this is that you will lose the generic combines that
LLVM has for add, min, etc.


-Tom

> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>
> turns into:
>
> v_mov_b32 %new, %old
> v_min_f32 %new, %src0, %src1 (dpp_control)
>
> and then we can get what we want directly, without have to do much
> optimization except for coalescing that already exists. The downside
> is that we'd have to add a lot more intrinsics (one for min, max,
> fmin, fmax, add, and iadd).
>
>>
>> -Tom
>>
>>> -Matt
>>>
>>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <[hidden email]> wrote:

> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <[hidden email]> wrote:
>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>
>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>
>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>> cc some people who have worked on this.
>>>>>>>>
>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>>>> GL_AMD_shader_ballot at
>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>>>> invocations.
>>>>>>>>>
>>>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>>>> this, modified from
>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>
>>>>>>>>> ; v0 is the input register
>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>>>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>>>
>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>>>> %result = foo %tmp, %input
>>>>>>>>>
>>>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>
>>>>>>>
>>>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>>>> or %input (bound_ctrl = 1)?
>>>>>>
>>>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>>>> to have the source and destination constrained to be the same. I
>>>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>>>> what we want if we want to implement the prefix scan operations
>>>>>> efficiently.
>>>>>>
>>>>>
>>>>> Ok, I see what you are saying now.  I think the best option here is to
>>>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>
>>>>> -Tom
>>>>
>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>
>>>
>>> I think re-defining the existing intrinsic will be easier than adding a new
>>> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
>>> for the bound_ctrl=1 case, which isn't really ideal.
>>
>> You can't re-define the existing intrinsic. The problem is that a move
>> with DPP doesn't just write to its destination register - it modifies
>> it, i.e. it reads and then writes to it. You can't express that with
>> SSA, since you can only ever write to an SSA value once. I think what
>> you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
>> call it?) that takes the value to use if the read is invalid. That is,
>> something like:
>>
>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>
>
> This is a more general form of what I'm suggesting.  If you re-define
> (when I say re-define, I just mean document this behavior)
> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value
> of %src will be written to %dst for invalid lanes, then this will
> have the same effect as llvm.amdgcn.update.dpp when %old = %update.
>
> I can see the usefulness now of having this extra intrinsic,
> but I still think it would be good to fix llvm.amdgcn.mov.dpp.
> At the MachineIR level, I think you could handle the new intrinsic
> and llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same
> pseudo instructions, so fixing llvm.amdgcn.mov.dpp may not
> be so hard if you are adding the new intrinsic.

Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to
generate the best possible code sequence, and it's a strict superset
of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm
inclined to not bother fixing it and just deprecate it instead.

>
>
>> would get turned into:
>>
>> v_mov_b32 %new, %old
>> v_mov_b32 %new, %update (dpp control)
>>
>> And hopefully coalescing will remove the first copy. Think of it as
>> like lowering LLVM's three-address form for e.g. ADD to the x86
>> two-address form. If you wanted to get fancy, you could also recognize
>> the case where the old value is zero and turn that into a DPP move
>> with bound_ctrl = 1.
>>
>> Also, remember that the whole point of this was to be able to express
>> stuff like:
>>
>> v_min_f32 v1, v0, v1 (dpp control)
>>
>> where you take the minimum of v1 and the swizzled v0, except where you
>> would've read an invalid lane for v0, you read the old value for v1
>> instead. For operations like add and iadd where the identity is 0, you
>> can set bound_ctrl = 1, and then the optimizer can safely fold the
>> v_mov_b32 into the operation itself. That is, you'd do:
>>
>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
>> %new = i32 add %swizzled, %old
>>
>> and after coalescing, register allocation, etc. the backend would turn
>> that into:
>>
>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>
>> which is functionally equivalent to the version without the bound_ctrl.
>>
>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>
>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>> %new = f32 fmin %swizzled, %old
>>
>> and then if the optimizer is clever enough, it can also fold it into
>> one instruction since the invalid lanes will get their old values
>> back. I think that covers all the cases we care about. The question is
>> whether it's better to take that route, or whether it's better to just
>> add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
>> that:
>>
>
> I would only go the route of adding a dpp intrinsic for every operation
> if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
> The main reason for this is that you will lose the generic combines that
> LLVM has for add, min, etc.

Ok, good point. I think this is only going to be used in a few
specific scenarios, but I can see stuff like fusing add + into mad
being useful.

>
>
> -Tom
>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>
>> turns into:
>>
>> v_mov_b32 %new, %old
>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>
>> and then we can get what we want directly, without have to do much
>> optimization except for coalescing that already exists. The downside
>> is that we'd have to add a lot more intrinsics (one for min, max,
>> fmin, fmax, add, and iadd).
>>
>>>
>>> -Tom
>>>
>>>> -Matt
>>>>
>>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
I'm wondering about the focus on bound_cntl.  Any cleared bit in the row_mask or bank_mask will also disable updating the result.
Brian

-----Original Message-----
From: Connor Abbott [mailto:[hidden email]]
Sent: Wednesday, June 14, 2017 6:13 PM
To: [hidden email]
Cc: Matt Arsenault; [hidden email]; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <[hidden email]> wrote:

> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <[hidden email]> wrote:
>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>
>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>
>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>> cc some people who have worked on this.
>>>>>>>>
>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been looking into how to implement the more advanced
>>>>>>>>> Shader Model
>>>>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>>>>> would be useful for radeonsi too). They're explained in the
>>>>>>>>> spec for GL_AMD_shader_ballot at
>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two
>>>>>>>>> types of operations:
>>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>>> operations. The reductions can be implemented in terms of the
>>>>>>>>> prefix scan (although in practice I don't think we want to
>>>>>>>>> implement them in exactly the same way), and the concerns are
>>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>> and an input value `a' (that's really a SIMD array with one
>>>>>>>>> value per invocation, even though it's a scalar value in
>>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op'
>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation
>>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control
>>>>>>>>> flow: it simply skips inactive invocations.
>>>>>>>>>
>>>>>>>>> On the LLVM side, I think that we have most of the
>>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you
>>>>>>>>> need to do this, but I can think of a few concerns/questions.
>>>>>>>>> First of all, to implement the prefix scan, we'll need to do a
>>>>>>>>> code sequence that looks like this, modified from
>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>>>>> (replace
>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>
>>>>>>>>> ; v0 is the input register
>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add
>>>>>>>>> two independent instructions to avoid a data hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction
>>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction
>>>>>>>>> 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these instructions use the DPP
>>>>>>>>> word isn't currently expressible in LLVM. We have the
>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For
>>>>>>>>> example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>>>>>>>>> What it's doing is shifting v0 right by one within each row
>>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
>>>>>>>>> something like this, in LLVM-like pseduocode:
>>>>>>>>>
>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>>>>> %tmp, %input
>>>>>>>>>
>>>>>>>>> but this is incorrect. If I'm reading the source correctly,
>>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns
>>>>>>>>> into a normal move with the dpp modifier, and no restrictions
>>>>>>>>> on the destination). We could set bound_ctrl to 0 to work
>>>>>>>>> around this, since it will make %tmp
>>>>>>>>> 0 in lane 0, but that won't work with operations whose
>>>>>>>>> identity is
>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>
>>>>>>>
>>>>>>> Why is %tmp garbage?  I thought the two options were 0
>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>
>>>>>> Oh, maybe it is... for that to happen the underlying move would
>>>>>> need to have the source and destination constrained to be the
>>>>>> same. I couldn't see that constraint anywhere I looked, but I'm
>>>>>> not an expert, so I may have overlooked it. In any case, that
>>>>>> behavior still isn't what we want if we want to implement the
>>>>>> prefix scan operations efficiently.
>>>>>>
>>>>>
>>>>> Ok, I see what you are saying now.  I think the best option here
>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp
>>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1
>>>>> and it reads from an invalid thread, and then when bound_ctrl=1,
>>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>
>>>>> -Tom
>>>>
>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>
>>>
>>> I think re-defining the existing intrinsic will be easier than
>>> adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp
>>> will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>>
>> You can't re-define the existing intrinsic. The problem is that a
>> move with DPP doesn't just write to its destination register - it
>> modifies it, i.e. it reads and then writes to it. You can't express
>> that with SSA, since you can only ever write to an SSA value once. I
>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp
>> (dunno what you'd call it?) that takes the value to use if the read
>> is invalid. That is, something like:
>>
>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>
>
> This is a more general form of what I'm suggesting.  If you re-define
> (when I say re-define, I just mean document this behavior)
> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src
> will be written to %dst for invalid lanes, then this will have the
> same effect as llvm.amdgcn.update.dpp when %old = %update.
>
> I can see the usefulness now of having this extra intrinsic, but I
> still think it would be good to fix llvm.amdgcn.mov.dpp.
> At the MachineIR level, I think you could handle the new intrinsic and
> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo
> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you
> are adding the new intrinsic.

Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead.

>
>
>> would get turned into:
>>
>> v_mov_b32 %new, %old
>> v_mov_b32 %new, %update (dpp control)
>>
>> And hopefully coalescing will remove the first copy. Think of it as
>> like lowering LLVM's three-address form for e.g. ADD to the x86
>> two-address form. If you wanted to get fancy, you could also
>> recognize the case where the old value is zero and turn that into a
>> DPP move with bound_ctrl = 1.
>>
>> Also, remember that the whole point of this was to be able to express
>> stuff like:
>>
>> v_min_f32 v1, v0, v1 (dpp control)
>>
>> where you take the minimum of v1 and the swizzled v0, except where
>> you would've read an invalid lane for v0, you read the old value for
>> v1 instead. For operations like add and iadd where the identity is 0,
>> you can set bound_ctrl = 1, and then the optimizer can safely fold
>> the
>> v_mov_b32 into the operation itself. That is, you'd do:
>>
>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
>> %new = i32 add %swizzled, %old
>>
>> and after coalescing, register allocation, etc. the backend would
>> turn that into:
>>
>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>
>> which is functionally equivalent to the version without the bound_ctrl.
>>
>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>
>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>> %new = f32 fmin %swizzled, %old
>>
>> and then if the optimizer is clever enough, it can also fold it into
>> one instruction since the invalid lanes will get their old values
>> back. I think that covers all the cases we care about. The question
>> is whether it's better to take that route, or whether it's better to
>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp,
>> etc. so
>> that:
>>
>
> I would only go the route of adding a dpp intrinsic for every
> operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
> The main reason for this is that you will lose the generic combines
> that LLVM has for add, min, etc.

Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful.

>
>
> -Tom
>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>
>> turns into:
>>
>> v_mov_b32 %new, %old
>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>
>> and then we can get what we want directly, without have to do much
>> optimization except for coalescing that already exists. The downside
>> is that we'd have to add a lot more intrinsics (one for min, max,
>> fmin, fmax, add, and iadd).
>>
>>>
>>> -Tom
>>>
>>>> -Matt
>>>>
>>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Alex Bradbury via llvm-dev
On Thu, Jun 15, 2017 at 6:56 AM, Sumner, Brian <[hidden email]> wrote:
> I'm wondering about the focus on bound_cntl.  Any cleared bit in the row_mask or bank_mask will also disable updating the result.
> Brian

True. Although, the same issue happens if you clear a bit in row_mask
or bank_mask as well.

>
> -----Original Message-----
> From: Connor Abbott [mailto:[hidden email]]
> Sent: Wednesday, June 14, 2017 6:13 PM
> To: [hidden email]
> Cc: Matt Arsenault; [hidden email]; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
>
> On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <[hidden email]> wrote:
>> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <[hidden email]> wrote:
>>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>>
>>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>>> cc some people who have worked on this.
>>>>>>>>>
>>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I've been looking into how to implement the more advanced
>>>>>>>>>> Shader Model
>>>>>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>>>>>> would be useful for radeonsi too). They're explained in the
>>>>>>>>>> spec for GL_AMD_shader_ballot at
>>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two
>>>>>>>>>> types of operations:
>>>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>>>> operations. The reductions can be implemented in terms of the
>>>>>>>>>> prefix scan (although in practice I don't think we want to
>>>>>>>>>> implement them in exactly the same way), and the concerns are
>>>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>>> and an input value `a' (that's really a SIMD array with one
>>>>>>>>>> value per invocation, even though it's a scalar value in
>>>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op'
>>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation
>>>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control
>>>>>>>>>> flow: it simply skips inactive invocations.
>>>>>>>>>>
>>>>>>>>>> On the LLVM side, I think that we have most of the
>>>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you
>>>>>>>>>> need to do this, but I can think of a few concerns/questions.
>>>>>>>>>> First of all, to implement the prefix scan, we'll need to do a
>>>>>>>>>> code sequence that looks like this, modified from
>>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
>>>>>>>>>> (replace
>>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>>
>>>>>>>>>> ; v0 is the input register
>>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add
>>>>>>>>>> two independent instructions to avoid a data hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction
>>>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction
>>>>>>>>>> 7
>>>>>>>>>>
>>>>>>>>>> The problem is that the way these instructions use the DPP
>>>>>>>>>> word isn't currently expressible in LLVM. We have the
>>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For
>>>>>>>>>> example, take the first
>>>>>>>>>> instruction:
>>>>>>>>>>
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>>
>>>>>>>>>> What it's doing is shifting v0 right by one within each row
>>>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
>>>>>>>>>> something like this, in LLVM-like pseduocode:
>>>>>>>>>>
>>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>>>>>> %tmp, %input
>>>>>>>>>>
>>>>>>>>>> but this is incorrect. If I'm reading the source correctly,
>>>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns
>>>>>>>>>> into a normal move with the dpp modifier, and no restrictions
>>>>>>>>>> on the destination). We could set bound_ctrl to 0 to work
>>>>>>>>>> around this, since it will make %tmp
>>>>>>>>>> 0 in lane 0, but that won't work with operations whose
>>>>>>>>>> identity is
>>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Why is %tmp garbage?  I thought the two options were 0
>>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>>
>>>>>>> Oh, maybe it is... for that to happen the underlying move would
>>>>>>> need to have the source and destination constrained to be the
>>>>>>> same. I couldn't see that constraint anywhere I looked, but I'm
>>>>>>> not an expert, so I may have overlooked it. In any case, that
>>>>>>> behavior still isn't what we want if we want to implement the
>>>>>>> prefix scan operations efficiently.
>>>>>>>
>>>>>>
>>>>>> Ok, I see what you are saying now.  I think the best option here
>>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp
>>>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1
>>>>>> and it reads from an invalid thread, and then when bound_ctrl=1,
>>>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>>
>>>>>> -Tom
>>>>>
>>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>>
>>>>
>>>> I think re-defining the existing intrinsic will be easier than
>>>> adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp
>>>> will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>>>
>>> You can't re-define the existing intrinsic. The problem is that a
>>> move with DPP doesn't just write to its destination register - it
>>> modifies it, i.e. it reads and then writes to it. You can't express
>>> that with SSA, since you can only ever write to an SSA value once. I
>>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp
>>> (dunno what you'd call it?) that takes the value to use if the read
>>> is invalid. That is, something like:
>>>
>>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>>
>>
>> This is a more general form of what I'm suggesting.  If you re-define
>> (when I say re-define, I just mean document this behavior)
>> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src
>> will be written to %dst for invalid lanes, then this will have the
>> same effect as llvm.amdgcn.update.dpp when %old = %update.
>>
>> I can see the usefulness now of having this extra intrinsic, but I
>> still think it would be good to fix llvm.amdgcn.mov.dpp.
>> At the MachineIR level, I think you could handle the new intrinsic and
>> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo
>> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you
>> are adding the new intrinsic.
>
> Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead.
>
>>
>>
>>> would get turned into:
>>>
>>> v_mov_b32 %new, %old
>>> v_mov_b32 %new, %update (dpp control)
>>>
>>> And hopefully coalescing will remove the first copy. Think of it as
>>> like lowering LLVM's three-address form for e.g. ADD to the x86
>>> two-address form. If you wanted to get fancy, you could also
>>> recognize the case where the old value is zero and turn that into a
>>> DPP move with bound_ctrl = 1.
>>>
>>> Also, remember that the whole point of this was to be able to express
>>> stuff like:
>>>
>>> v_min_f32 v1, v0, v1 (dpp control)
>>>
>>> where you take the minimum of v1 and the swizzled v0, except where
>>> you would've read an invalid lane for v0, you read the old value for
>>> v1 instead. For operations like add and iadd where the identity is 0,
>>> you can set bound_ctrl = 1, and then the optimizer can safely fold
>>> the
>>> v_mov_b32 into the operation itself. That is, you'd do:
>>>
>>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
>>> %new = i32 add %swizzled, %old
>>>
>>> and after coalescing, register allocation, etc. the backend would
>>> turn that into:
>>>
>>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>>
>>> which is functionally equivalent to the version without the bound_ctrl.
>>>
>>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>>
>>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>> %new = f32 fmin %swizzled, %old
>>>
>>> and then if the optimizer is clever enough, it can also fold it into
>>> one instruction since the invalid lanes will get their old values
>>> back. I think that covers all the cases we care about. The question
>>> is whether it's better to take that route, or whether it's better to
>>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp,
>>> etc. so
>>> that:
>>>
>>
>> I would only go the route of adding a dpp intrinsic for every
>> operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
>> The main reason for this is that you will lose the generic combines
>> that LLVM has for add, min, etc.
>
> Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful.
>
>>
>>
>> -Tom
>>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>>
>>> turns into:
>>>
>>> v_mov_b32 %new, %old
>>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>>
>>> and then we can get what we want directly, without have to do much
>>> optimization except for coalescing that already exists. The downside
>>> is that we'd have to add a lot more intrinsics (one for min, max,
>>> fmin, fmax, add, and iadd).
>>>
>>>>
>>>> -Tom
>>>>
>>>>> -Matt
>>>>>
>>>>
>>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev