[llvm-dev] NVPTX - Reordering load instructions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] NVPTX - Reordering load instructions

Robin Eklind via llvm-dev
Hi all,

I'm looking into the performance difference of a benchmark compiled with
NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
significant difference due to PTX instruction ordering. The relevant
source code consists of two nested loops that get fully unrolled, doing
some basic arithmetic with values loaded from shared memory:

> #define BLOCK_SIZE 16
>
> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>
> int idx = threadIdx.x - BLOCK_SIZE;
> for (int i = 0; i < BLOCK_SIZE; i++) {
>  for (int j = 0; j < i; j++)
>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>  peri_col[idx][i] /= dia[i][i];
> }

NVCC emits PTX instructions where all loads from shared memory are
packed together:

> ...
> ld.shared.f32   %f546, [kernel_dia+440];
> ld.shared.f32   %f545, [%r4+-996];
> ld.shared.f32   %f544, [kernel_dia+56];
> ld.shared.f32   %f543, [kernel_dia+88];
> ld.shared.f32   %f542, [kernel_dia+500];
> ld.shared.f32   %f541, [kernel_dia+84];
> ld.shared.f32   %f540, [%r4+-972];
> ld.shared.f32   %f539, [%r4+-1008];
> ld.shared.f32   %f538, [kernel_dia+496];
> ld.shared.f32   %f537, [kernel_dia+136];
> ld.shared.f32   %f536, [%r4+-976];
> ld.shared.f32   %f535, [kernel_dia+428];
> ... # hundreds of these

Even though this heavily bloats register usage (and NVCC seems to do
this unconditionally, even with launch configurations where this could
hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:

> LDS.128 R76, [0x2f0];
> LDS.128 R60, [0xa0];
> LDS.128 R72, [0x130];
> LDS.128 R96, [0x1b0];
> LDS.128 R92, [0x30];
> LDS.128 R116, [0x50];
> LDS.128 R108, [0x1f0];

LLVM preserves the operations more or less as they are emitted by the
front-end, interleaving memory operations with arithmetic. As a result,
the SASS code contains many more 32-bit loads, which lowers performance
by ~10% on this specific benchmark.

What would be the best approach to improve generated code? I can imagine
a late IR pass shuffling instructions around, but I figured I'd ask to
see if this is a good approach and whether there's existing work doing
similar transformations.

Thanks,
--
Tim Besard
Computer Systems Lab
Department of Electronics & Information Systems
Ghent University
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] NVPTX - Reordering load instructions

Robin Eklind via llvm-dev

On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:

> Hi all,
>
> I'm looking into the performance difference of a benchmark compiled with
> NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
> significant difference due to PTX instruction ordering. The relevant
> source code consists of two nested loops that get fully unrolled, doing
> some basic arithmetic with values loaded from shared memory:
>
>> #define BLOCK_SIZE 16
>>
>> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
>> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>>
>> int idx = threadIdx.x - BLOCK_SIZE;
>> for (int i = 0; i < BLOCK_SIZE; i++) {
>>  for (int j = 0; j < i; j++)
>>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>>  peri_col[idx][i] /= dia[i][i];
>> }
> NVCC emits PTX instructions where all loads from shared memory are
> packed together:
>
>> ...
>> ld.shared.f32   %f546, [kernel_dia+440];
>> ld.shared.f32   %f545, [%r4+-996];
>> ld.shared.f32   %f544, [kernel_dia+56];
>> ld.shared.f32   %f543, [kernel_dia+88];
>> ld.shared.f32   %f542, [kernel_dia+500];
>> ld.shared.f32   %f541, [kernel_dia+84];
>> ld.shared.f32   %f540, [%r4+-972];
>> ld.shared.f32   %f539, [%r4+-1008];
>> ld.shared.f32   %f538, [kernel_dia+496];
>> ld.shared.f32   %f537, [kernel_dia+136];
>> ld.shared.f32   %f536, [%r4+-976];
>> ld.shared.f32   %f535, [kernel_dia+428];
>> ... # hundreds of these
> Even though this heavily bloats register usage (and NVCC seems to do
> this unconditionally, even with launch configurations where this could
> hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:
>
>> LDS.128 R76, [0x2f0];
>> LDS.128 R60, [0xa0];
>> LDS.128 R72, [0x130];
>> LDS.128 R96, [0x1b0];
>> LDS.128 R92, [0x30];
>> LDS.128 R116, [0x50];
>> LDS.128 R108, [0x1f0];
> LLVM preserves the operations more or less as they are emitted by the
> front-end, interleaving memory operations with arithmetic. As a result,
> the SASS code contains many more 32-bit loads, which lowers performance
> by ~10% on this specific benchmark.
>
> What would be the best approach to improve generated code? I can imagine
> a late IR pass shuffling instructions around, but I figured I'd ask to
> see if this is a good approach and whether there's existing work doing
> similar transformations.

You could make a custom pass, late IR or MI. You might also be able to
use the existing instruction-scheduling infrastructure. You can
implement ScheduleDAGMutation that does the clustering that you'd like,
or if the existing ones do what you want, use those. We have preexisting
createLoadClusterDAGMutation and createStoreClusterDAGMutation
functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
calls like this:

    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));

and I think that you probably want to do the same.

Also, you might want to override the subtarget's useAA() method to
return true (as this gives more freedom to the scheduler to move memory
accesses around to do this kind of clustering).

 -Hal

>
> Thanks,

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] NVPTX - Reordering load instructions

Robin Eklind via llvm-dev
We already have a pass that vectorizes loads and stores in nvptx and amdgpu.  Not at my laptop, I forget the exact filename, but it's called load-store vectorizer.

I think the question is, why is LSV not vectorizing this code?

I think the answer is, llvm can't tell that the loads are aligned.  Ptxas can, but only because it's (apparently) doing vectorization *after* it reesolves the shmem variables to physical addresses.  That is a cool trick, and llvm can't do it, because llvm never sees the physical shmem addresses.

If you told llvm that the shmem variables were aligned to 16 bytes, LSV might do what you want here.  llvm and ptxas should be able to cooperate to give you the alignment you ask for in the IR.

If that doesn't work I'd recommend trying to debug the LSV code to see why it's not vectorizing.  You can run `opt -load-store-vectorizer -debug` -- or, my favorite way of doing it is to run that command under rr-project.org.

It's possible that clang should opportunistically mark all shmem variables over a certain size as align(16) so that this happens automagically.  That would kind of be a weird heuristic, but maybe it makes sense.  I don't think that would make sense for LLVM to do that, though, so it wouldn't help you.

I think relying on LSV to do its job is better than messing with the instruction order because the former is more powerful -- it can vectorize in cases where ptxas would have a much harder time.

Justin

On Thu, Jun 21, 2018, 7:48 AM Hal Finkel via llvm-dev <[hidden email]> wrote:

On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:
> Hi all,
>
> I'm looking into the performance difference of a benchmark compiled with
> NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
> significant difference due to PTX instruction ordering. The relevant
> source code consists of two nested loops that get fully unrolled, doing
> some basic arithmetic with values loaded from shared memory:
>
>> #define BLOCK_SIZE 16
>>
>> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
>> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>>
>> int idx = threadIdx.x - BLOCK_SIZE;
>> for (int i = 0; i < BLOCK_SIZE; i++) {
>>  for (int j = 0; j < i; j++)
>>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>>  peri_col[idx][i] /= dia[i][i];
>> }
> NVCC emits PTX instructions where all loads from shared memory are
> packed together:
>
>> ...
>> ld.shared.f32   %f546, [kernel_dia+440];
>> ld.shared.f32   %f545, [%r4+-996];
>> ld.shared.f32   %f544, [kernel_dia+56];
>> ld.shared.f32   %f543, [kernel_dia+88];
>> ld.shared.f32   %f542, [kernel_dia+500];
>> ld.shared.f32   %f541, [kernel_dia+84];
>> ld.shared.f32   %f540, [%r4+-972];
>> ld.shared.f32   %f539, [%r4+-1008];
>> ld.shared.f32   %f538, [kernel_dia+496];
>> ld.shared.f32   %f537, [kernel_dia+136];
>> ld.shared.f32   %f536, [%r4+-976];
>> ld.shared.f32   %f535, [kernel_dia+428];
>> ... # hundreds of these
> Even though this heavily bloats register usage (and NVCC seems to do
> this unconditionally, even with launch configurations where this could
> hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:
>
>> LDS.128 R76, [0x2f0];
>> LDS.128 R60, [0xa0];
>> LDS.128 R72, [0x130];
>> LDS.128 R96, [0x1b0];
>> LDS.128 R92, [0x30];
>> LDS.128 R116, [0x50];
>> LDS.128 R108, [0x1f0];
> LLVM preserves the operations more or less as they are emitted by the
> front-end, interleaving memory operations with arithmetic. As a result,
> the SASS code contains many more 32-bit loads, which lowers performance
> by ~10% on this specific benchmark.
>
> What would be the best approach to improve generated code? I can imagine
> a late IR pass shuffling instructions around, but I figured I'd ask to
> see if this is a good approach and whether there's existing work doing
> similar transformations.

You could make a custom pass, late IR or MI. You might also be able to
use the existing instruction-scheduling infrastructure. You can
implement ScheduleDAGMutation that does the clustering that you'd like,
or if the existing ones do what you want, use those. We have preexisting
createLoadClusterDAGMutation and createStoreClusterDAGMutation
functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
calls like this:

    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));

and I think that you probably want to do the same.

Also, you might want to override the subtarget's useAA() method to
return true (as this gives more freedom to the scheduler to move memory
accesses around to do this kind of clustering).

 -Hal

>
> Thanks,

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] NVPTX - Reordering load instructions

Robin Eklind via llvm-dev

On 06/21/2018 02:32 PM, Justin Lebar wrote:
We already have a pass that vectorizes loads and stores in nvptx and amdgpu.  Not at my laptop, I forget the exact filename, but it's called load-store vectorizer.

It's here: lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

I agree that, if LLVM can do this explicitly, it seems better.

 -Hal


I think the question is, why is LSV not vectorizing this code?

I think the answer is, llvm can't tell that the loads are aligned.  Ptxas can, but only because it's (apparently) doing vectorization *after* it reesolves the shmem variables to physical addresses.  That is a cool trick, and llvm can't do it, because llvm never sees the physical shmem addresses.

If you told llvm that the shmem variables were aligned to 16 bytes, LSV might do what you want here.  llvm and ptxas should be able to cooperate to give you the alignment you ask for in the IR.

If that doesn't work I'd recommend trying to debug the LSV code to see why it's not vectorizing.  You can run `opt -load-store-vectorizer -debug` -- or, my favorite way of doing it is to run that command under rr-project.org.

It's possible that clang should opportunistically mark all shmem variables over a certain size as align(16) so that this happens automagically.  That would kind of be a weird heuristic, but maybe it makes sense.  I don't think that would make sense for LLVM to do that, though, so it wouldn't help you.

I think relying on LSV to do its job is better than messing with the instruction order because the former is more powerful -- it can vectorize in cases where ptxas would have a much harder time.

Justin

On Thu, Jun 21, 2018, 7:48 AM Hal Finkel via llvm-dev <[hidden email]> wrote:

On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:
> Hi all,
>
> I'm looking into the performance difference of a benchmark compiled with
> NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
> significant difference due to PTX instruction ordering. The relevant
> source code consists of two nested loops that get fully unrolled, doing
> some basic arithmetic with values loaded from shared memory:
>
>> #define BLOCK_SIZE 16
>>
>> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
>> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>>
>> int idx = threadIdx.x - BLOCK_SIZE;
>> for (int i = 0; i < BLOCK_SIZE; i++) {
>>  for (int j = 0; j < i; j++)
>>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>>  peri_col[idx][i] /= dia[i][i];
>> }
> NVCC emits PTX instructions where all loads from shared memory are
> packed together:
>
>> ...
>> ld.shared.f32   %f546, [kernel_dia+440];
>> ld.shared.f32   %f545, [%r4+-996];
>> ld.shared.f32   %f544, [kernel_dia+56];
>> ld.shared.f32   %f543, [kernel_dia+88];
>> ld.shared.f32   %f542, [kernel_dia+500];
>> ld.shared.f32   %f541, [kernel_dia+84];
>> ld.shared.f32   %f540, [%r4+-972];
>> ld.shared.f32   %f539, [%r4+-1008];
>> ld.shared.f32   %f538, [kernel_dia+496];
>> ld.shared.f32   %f537, [kernel_dia+136];
>> ld.shared.f32   %f536, [%r4+-976];
>> ld.shared.f32   %f535, [kernel_dia+428];
>> ... # hundreds of these
> Even though this heavily bloats register usage (and NVCC seems to do
> this unconditionally, even with launch configurations where this could
> hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:
>
>> LDS.128 R76, [0x2f0];
>> LDS.128 R60, [0xa0];
>> LDS.128 R72, [0x130];
>> LDS.128 R96, [0x1b0];
>> LDS.128 R92, [0x30];
>> LDS.128 R116, [0x50];
>> LDS.128 R108, [0x1f0];
> LLVM preserves the operations more or less as they are emitted by the
> front-end, interleaving memory operations with arithmetic. As a result,
> the SASS code contains many more 32-bit loads, which lowers performance
> by ~10% on this specific benchmark.
>
> What would be the best approach to improve generated code? I can imagine
> a late IR pass shuffling instructions around, but I figured I'd ask to
> see if this is a good approach and whether there's existing work doing
> similar transformations.

You could make a custom pass, late IR or MI. You might also be able to
use the existing instruction-scheduling infrastructure. You can
implement ScheduleDAGMutation that does the clustering that you'd like,
or if the existing ones do what you want, use those. We have preexisting
createLoadClusterDAGMutation and createStoreClusterDAGMutation
functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
calls like this:

    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));

and I think that you probably want to do the same.

Also, you might want to override the subtarget's useAA() method to
return true (as this gives more freedom to the scheduler to move memory
accesses around to do this kind of clustering).

 -Hal

>
> Thanks,

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] NVPTX - Reordering load instructions

Robin Eklind via llvm-dev
Hi Hal, Justin,

Very useful information, thanks! LSV definitely seems like the best way
to approach this, I was too focused on matching nvcc's output. It also
doesn't look like NVPTX uses the MachineScheduler, and enabling it +
load clustering didn't seem to have any impact (but I didn't look very
closely into it).

> I think the answer is, llvm can't tell that the loads are aligned.
> Ptxas can, but only because it's (apparently) doing vectorization
> *after* it reesolves the shmem variables to physical addresses.  That
> is a cool trick, and llvm can't do it, because llvm never sees the
> physical shmem addresses.
>
> If you told llvm that the shmem variables were aligned to 16 bytes,
> LSV might do what you want here.  llvm and ptxas should be able to
> cooperate to give you the alignment you ask for in the IR.

That's pretty cool indeed, bumping the shmem GV alignment to 16 bytes
enables LSV and gets me most of the way. Some operations still aren't
vectorized though, but I know where to look now.

> It's possible that clang should opportunistically mark all shmem
> variables over a certain size as align(16) so that this happens
> automagically.  That would kind of be a weird heuristic, but maybe it
> makes sense.  I don't think that would make sense for LLVM to do that,
> though, so it wouldn't help you.

Easy enough for us to do this [1], so I'll try it out :-) That said,
nvcc emits code with `.align 4`. Maybe they rely on ptxas for that.

1: https://github.com/JuliaGPU/CUDAnative.jl/pull/204

Best,
--
Tim Besard
Computer Systems Lab
Department of Electronics & Information Systems
Ghent University
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev