[NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu-2
Hi, 

We wrote an optimization that eliminates common sub-expressions in a group of similar GEPs for the NVPTX backend. It speeds up some of our benchmarks by up to 20%, which convinces us to try to upstream it. Here's a brief description of why we wrote this optimization, what we did, and how we did it.

Loops in CUDA programs are often extensively unrolled by programmers and compilers, leading to many similar
GEPs for array accesses.

e.g., a 2-level loop like

__shared__ float a[32][32];
unroll for (int i = 0; i < 2; ++i) {
  unroll for (int j = 0; j < 2; ++j) {
    ...
    ... = a[threadIdx.x + i][threadIdx.y + j];
    ...
  }
}

will be unrolled to:

gep a, 0, tid.x, tid.y; load
gep a, 0, tid.x, tid.y + 1; load
gep a, 0, tid.x + 1, tid.y; load
gep a, 0, tid.x + 1, tid.y + 1; load

The NVPTX backend currently doesn't handle many similar multi-dimensional GEPs
well enough. It emits PTX code that literally computes the pointer address of
each GEP, wasting tons of registers. e.g., it emits the following PTX for the
first load and similar PTX for other loads.

mov.u32         %r1, %tid.x;
mov.u32         %r2, %tid.y;
mul.wide.u32    %rl2, %r1, 128;
mov.u64         %rl3, a;
add.s64         %rl4, %rl3, %rl2;
mul.wide.u32    %rl5, %r2, 4;
add.s64         %rl6, %rl4, %rl5;
ld.shared.f32   %f1, [%rl6];

The resultant register pressure causes up to 20% slowdown on some of our
benchmarks.

To reduce register pressure, the optimization implemented in this patch merges
the common subexpression in a group of GEPs, saving many registers used for
pointer arithmetics. It works by splitting each GEP into a variadic base and a
constant offset. The variadic base can be computed once and reused by multiple
GEPs, and the constant offsets can be nicely folded into NVPTX's base+offset
addressing mode without using any extra register. e.g., we transform the four
GEPs and four loads in the above example conceptually into:

base = gep a, 0, x, y
load base
laod base + 1  * sizeof(float)
load base + 32 * sizeof(float)
load base + 33 * sizeof(float)

The resultant PTX code will look like:

mov.u32         %r1, %tid.x;
mov.u32         %r2, %tid.y;
mul.wide.u32    %rl2, %r1, 128;
mov.u64         %rl3, a;
add.s64         %rl4, %rl3, %rl2;
mul.wide.u32    %rl5, %r2, 4;
add.s64         %rl6, %rl4, %rl5;
ld.shared.f32   %f1, [%rl6]; // so far the same as unoptimized PTX
ld.shared.f32   %f2, [%rl6+4]; // much better
ld.shared.f32   %f3, [%rl6+128]; // much better
ld.shared.f32   %f4, [%rl6+132]; // much better

which uses much fewer registers than the unoptimized PTX.

I am attaching a proof-of-concept patch. It fully implements our idea and contains a contrived test case to demonstrate how it works. It also discusses why our implementation is safe in terms that the optimization won't cause new undefined behavior. There's more work that needs to be done, e.g., adding more tests. If this idea sounds good to you, we will improve the patch and send it out for code review. 

Thanks, 
Jingyue

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

gep-cse.patch (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hal Finkel
Jingyue,

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.

Thanks for posting this,
Hal

----- Original Message -----

> From: "Jingyue Wu" <[hidden email]>
> To: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Saturday, April 19, 2014 12:02:28 AM
> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
>
>
>
>
> Hi,
>
>
> We wrote an optimization that eliminates common sub-expressions in a
> group of similar GEPs for the NVPTX backend. It speeds up some of
> our benchmarks by up to 20%, which convinces us to try to upstream
> it. Here's a brief description of why we wrote this optimization,
> what we did, and how we did it.
>
>
> Loops in CUDA programs are often extensively unrolled by programmers
> and compilers, leading to many similar
> GEPs for array accesses.
>
>
> e.g., a 2-level loop like
>
>
> __shared__ float a[32][32];
> unroll for (int i = 0; i < 2; ++i) {
> unroll for (int j = 0; j < 2; ++j) {
> ...
> ... = a[threadIdx.x + i][threadIdx.y + j];
> ...
> }
> }
>
>
> will be unrolled to:
>
>
> gep a, 0, tid.x, tid.y; load
> gep a, 0, tid.x, tid.y + 1; load
> gep a, 0, tid.x + 1, tid.y; load
> gep a, 0, tid.x + 1, tid.y + 1; load
>
>
> The NVPTX backend currently doesn't handle many similar
> multi-dimensional GEPs
> well enough. It emits PTX code that literally computes the pointer
> address of
> each GEP, wasting tons of registers. e.g., it emits the following PTX
> for the
> first load and similar PTX for other loads.
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6];
>
>
> The resultant register pressure causes up to 20% slowdown on some of
> our
> benchmarks.
>
>
> To reduce register pressure, the optimization implemented in this
> patch merges
> the common subexpression in a group of GEPs, saving many registers
> used for
> pointer arithmetics. It works by splitting each GEP into a variadic
> base and a
> constant offset. The variadic base can be computed once and reused by
> multiple
> GEPs, and the constant offsets can be nicely folded into NVPTX's
> base+offset
> addressing mode without using any extra register. e.g., we transform
> the four
> GEPs and four loads in the above example conceptually into:
>
>
> base = gep a, 0, x, y
> load base
> laod base + 1 * sizeof(float)
> load base + 32 * sizeof(float)
> load base + 33 * sizeof(float)
>
>
> The resultant PTX code will look like:
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> ld.shared.f32 %f2, [%rl6+4]; // much better
> ld.shared.f32 %f3, [%rl6+128]; // much better
> ld.shared.f32 %f4, [%rl6+132]; // much better
>
>
> which uses much fewer registers than the unoptimized PTX.
>
>
> I am attaching a proof-of-concept patch. It fully implements our idea
> and contains a contrived test case to demonstrate how it works. It
> also discusses why our implementation is safe in terms that the
> optimization won't cause new undefined behavior. There's more work
> that needs to be done, e.g., adding more tests. If this idea sounds
> good to you, we will improve the patch and send it out for code
> review.
>
>
>
> Thanks,
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Justin Holewinski-2
This looks great!  I'm a bit surprised the existing IR optimizers do not handle this.  I agree with Hal that this should be made target-independent.  I don't see anything here that would be specific to NVPTX.  Do you have any performance data for open-source benchmarks?


On Sat, Apr 19, 2014 at 9:38 AM, Hal Finkel <[hidden email]> wrote:
Jingyue,

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.

Thanks for posting this,
Hal

----- Original Message -----
> From: "Jingyue Wu" <[hidden email]>
> To: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Saturday, April 19, 2014 12:02:28 AM
> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of     similar GEPs
>
>
>
>
> Hi,
>
>
> We wrote an optimization that eliminates common sub-expressions in a
> group of similar GEPs for the NVPTX backend. It speeds up some of
> our benchmarks by up to 20%, which convinces us to try to upstream
> it. Here's a brief description of why we wrote this optimization,
> what we did, and how we did it.
>
>
> Loops in CUDA programs are often extensively unrolled by programmers
> and compilers, leading to many similar
> GEPs for array accesses.
>
>
> e.g., a 2-level loop like
>
>
> __shared__ float a[32][32];
> unroll for (int i = 0; i < 2; ++i) {
> unroll for (int j = 0; j < 2; ++j) {
> ...
> ... = a[threadIdx.x + i][threadIdx.y + j];
> ...
> }
> }
>
>
> will be unrolled to:
>
>
> gep a, 0, tid.x, tid.y; load
> gep a, 0, tid.x, tid.y + 1; load
> gep a, 0, tid.x + 1, tid.y; load
> gep a, 0, tid.x + 1, tid.y + 1; load
>
>
> The NVPTX backend currently doesn't handle many similar
> multi-dimensional GEPs
> well enough. It emits PTX code that literally computes the pointer
> address of
> each GEP, wasting tons of registers. e.g., it emits the following PTX
> for the
> first load and similar PTX for other loads.
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6];
>
>
> The resultant register pressure causes up to 20% slowdown on some of
> our
> benchmarks.
>
>
> To reduce register pressure, the optimization implemented in this
> patch merges
> the common subexpression in a group of GEPs, saving many registers
> used for
> pointer arithmetics. It works by splitting each GEP into a variadic
> base and a
> constant offset. The variadic base can be computed once and reused by
> multiple
> GEPs, and the constant offsets can be nicely folded into NVPTX's
> base+offset
> addressing mode without using any extra register. e.g., we transform
> the four
> GEPs and four loads in the above example conceptually into:
>
>
> base = gep a, 0, x, y
> load base
> laod base + 1 * sizeof(float)
> load base + 32 * sizeof(float)
> load base + 33 * sizeof(float)
>
>
> The resultant PTX code will look like:
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> ld.shared.f32 %f2, [%rl6+4]; // much better
> ld.shared.f32 %f3, [%rl6+128]; // much better
> ld.shared.f32 %f4, [%rl6+132]; // much better
>
>
> which uses much fewer registers than the unoptimized PTX.
>
>
> I am attaching a proof-of-concept patch. It fully implements our idea
> and contains a contrived test case to demonstrate how it works. It
> also discusses why our implementation is safe in terms that the
> optimization won't cause new undefined behavior. There's more work
> that needs to be done, e.g., adding more tests. If this idea sounds
> good to you, we will improve the patch and send it out for code
> review.
>
>
>
> Thanks,
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory



--

Thanks,

Justin Holewinski

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu-2

not yet, but I am more than happy to try. any good benchmarks in your mind?

On Apr 19, 2014 7:34 AM, "Justin Holewinski" <[hidden email]> wrote:
This looks great!  I'm a bit surprised the existing IR optimizers do not handle this.  I agree with Hal that this should be made target-independent.  I don't see anything here that would be specific to NVPTX.  Do you have any performance data for open-source benchmarks?


On Sat, Apr 19, 2014 at 9:38 AM, Hal Finkel <[hidden email]> wrote:
Jingyue,

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.

Thanks for posting this,
Hal

----- Original Message -----
> From: "Jingyue Wu" <[hidden email]>
> To: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Saturday, April 19, 2014 12:02:28 AM
> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of     similar GEPs
>
>
>
>
> Hi,
>
>
> We wrote an optimization that eliminates common sub-expressions in a
> group of similar GEPs for the NVPTX backend. It speeds up some of
> our benchmarks by up to 20%, which convinces us to try to upstream
> it. Here's a brief description of why we wrote this optimization,
> what we did, and how we did it.
>
>
> Loops in CUDA programs are often extensively unrolled by programmers
> and compilers, leading to many similar
> GEPs for array accesses.
>
>
> e.g., a 2-level loop like
>
>
> __shared__ float a[32][32];
> unroll for (int i = 0; i < 2; ++i) {
> unroll for (int j = 0; j < 2; ++j) {
> ...
> ... = a[threadIdx.x + i][threadIdx.y + j];
> ...
> }
> }
>
>
> will be unrolled to:
>
>
> gep a, 0, tid.x, tid.y; load
> gep a, 0, tid.x, tid.y + 1; load
> gep a, 0, tid.x + 1, tid.y; load
> gep a, 0, tid.x + 1, tid.y + 1; load
>
>
> The NVPTX backend currently doesn't handle many similar
> multi-dimensional GEPs
> well enough. It emits PTX code that literally computes the pointer
> address of
> each GEP, wasting tons of registers. e.g., it emits the following PTX
> for the
> first load and similar PTX for other loads.
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6];
>
>
> The resultant register pressure causes up to 20% slowdown on some of
> our
> benchmarks.
>
>
> To reduce register pressure, the optimization implemented in this
> patch merges
> the common subexpression in a group of GEPs, saving many registers
> used for
> pointer arithmetics. It works by splitting each GEP into a variadic
> base and a
> constant offset. The variadic base can be computed once and reused by
> multiple
> GEPs, and the constant offsets can be nicely folded into NVPTX's
> base+offset
> addressing mode without using any extra register. e.g., we transform
> the four
> GEPs and four loads in the above example conceptually into:
>
>
> base = gep a, 0, x, y
> load base
> laod base + 1 * sizeof(float)
> load base + 32 * sizeof(float)
> load base + 33 * sizeof(float)
>
>
> The resultant PTX code will look like:
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> ld.shared.f32 %f2, [%rl6+4]; // much better
> ld.shared.f32 %f3, [%rl6+128]; // much better
> ld.shared.f32 %f4, [%rl6+132]; // much better
>
>
> which uses much fewer registers than the unoptimized PTX.
>
>
> I am attaching a proof-of-concept patch. It fully implements our idea
> and contains a contrived test case to demonstrate how it works. It
> also discusses why our implementation is safe in terms that the
> optimization won't cause new undefined behavior. There's more work
> that needs to be done, e.g., adding more tests. If this idea sounds
> good to you, we will improve the patch and send it out for code
> review.
>
>
>
> Thanks,
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory



--

Thanks,

Justin Holewinski

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu-2
In reply to this post by Hal Finkel
Hi Hal, 

Thanks for your comments! I'm inlining my responses below.

Jingyue


On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel <[hidden email]> wrote:
Jingyue,

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).

EarlyCSE surprisingly generates code that runs 10% slower than GVN on one of our benchmarks. However, after looking into the ll and ptx, we probably shouldn't blame EarlyCSE for the slowdown. The ptx generated using EarlyCSE is only slightly different from that generated by GVN, but ptxas amplifies the seemingly unharmful difference to much more register usage. We need to further investigate this issue.

I can change the code to use EarlyCSE by default, and leave a flag to run GVN. 
 

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?

Cross-block combining from GVN is one of the reasons. The second reason is we observed a GEP and its index are sometimes defined in different BBs. The third reason is we also observed a GEP and the load/store that uses it are sometimes in different BBs. Although CodeGenPrepare attempts to sink a GEP to the same BB as its load/store user, it doesn't sink GEPs with two variadic indices (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx doesn't support any addressing mode that can fold this GEP (This addressing mode issue is worth another thread, and one of my colleagues is working on that). 
 

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.

Ack'ed
 

Thanks for posting this,
Hal

----- Original Message -----
> From: "Jingyue Wu" <[hidden email]>
> To: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Saturday, April 19, 2014 12:02:28 AM
> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of     similar GEPs
>
>
>
>
> Hi,
>
>
> We wrote an optimization that eliminates common sub-expressions in a
> group of similar GEPs for the NVPTX backend. It speeds up some of
> our benchmarks by up to 20%, which convinces us to try to upstream
> it. Here's a brief description of why we wrote this optimization,
> what we did, and how we did it.
>
>
> Loops in CUDA programs are often extensively unrolled by programmers
> and compilers, leading to many similar
> GEPs for array accesses.
>
>
> e.g., a 2-level loop like
>
>
> __shared__ float a[32][32];
> unroll for (int i = 0; i < 2; ++i) {
> unroll for (int j = 0; j < 2; ++j) {
> ...
> ... = a[threadIdx.x + i][threadIdx.y + j];
> ...
> }
> }
>
>
> will be unrolled to:
>
>
> gep a, 0, tid.x, tid.y; load
> gep a, 0, tid.x, tid.y + 1; load
> gep a, 0, tid.x + 1, tid.y; load
> gep a, 0, tid.x + 1, tid.y + 1; load
>
>
> The NVPTX backend currently doesn't handle many similar
> multi-dimensional GEPs
> well enough. It emits PTX code that literally computes the pointer
> address of
> each GEP, wasting tons of registers. e.g., it emits the following PTX
> for the
> first load and similar PTX for other loads.
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6];
>
>
> The resultant register pressure causes up to 20% slowdown on some of
> our
> benchmarks.
>
>
> To reduce register pressure, the optimization implemented in this
> patch merges
> the common subexpression in a group of GEPs, saving many registers
> used for
> pointer arithmetics. It works by splitting each GEP into a variadic
> base and a
> constant offset. The variadic base can be computed once and reused by
> multiple
> GEPs, and the constant offsets can be nicely folded into NVPTX's
> base+offset
> addressing mode without using any extra register. e.g., we transform
> the four
> GEPs and four loads in the above example conceptually into:
>
>
> base = gep a, 0, x, y
> load base
> laod base + 1 * sizeof(float)
> load base + 32 * sizeof(float)
> load base + 33 * sizeof(float)
>
>
> The resultant PTX code will look like:
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> ld.shared.f32 %f2, [%rl6+4]; // much better
> ld.shared.f32 %f3, [%rl6+128]; // much better
> ld.shared.f32 %f4, [%rl6+132]; // much better
>
>
> which uses much fewer registers than the unoptimized PTX.
>
>
> I am attaching a proof-of-concept patch. It fully implements our idea
> and contains a contrived test case to demonstrate how it works. It
> also discusses why our implementation is safe in terms that the
> optimization won't cause new undefined behavior. There's more work
> that needs to be done, e.g., adding more tests. If this idea sounds
> good to you, we will improve the patch and send it out for code
> review.
>
>
>
> Thanks,
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hal Finkel
----- Original Message -----

> From: "Jingyue Wu" <[hidden email]>
> To: "Hal Finkel" <[hidden email]>
> Cc: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Sunday, April 20, 2014 9:34:57 PM
> Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
>
>
> Hi Hal,
>
>
> Thanks for your comments! I'm inlining my responses below.
>
>
> Jingyue
>
>
>
> On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < [hidden email] >
> wrote:
>
>
> Jingyue,
>
> I can't speak for the NVPTX backend, but I think this looks useful as
> an (optional) target-independent pass. A few thoughts:
>
> - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> instead? (When I was working on the BB vectorizer I was first using
> GVN for cleanup afterward, someone suggested trying EarlyCSE
> instead, the performance slowdown was a bit less than 1% on average,
> but the runtime impact was much less).
>
>
>
> EarlyCSE surprisingly generates code that runs 10% slower than GVN on
> one of our benchmarks. However, after looking into the ll and ptx,
> we probably shouldn't blame EarlyCSE for the slowdown. The ptx
> generated using EarlyCSE is only slightly different from that
> generated by GVN, but ptxas amplifies the seemingly unharmful
> difference to much more register usage. We need to further
> investigate this issue.
>
>
> I can change the code to use EarlyCSE by default, and leave a flag to
> run GVN.

I think it is also perfectly reasonable (and perhaps even makes sense) to choose GVN for CodeGenOpt::Aggressive and EarlyCSE otherwise.

>
>
>
> - Are you doing this on the IR level, instead of in DAGCombine,
> because you want the cross-block combining from GVN? Or some other
> reason (or both)?
>
>
>
> Cross-block combining from GVN is one of the reasons. The second
> reason is we observed a GEP and its index are sometimes defined in
> different BBs.

Makes sense.

> The third reason is we also observed a GEP and the
> load/store that uses it are sometimes in different BBs. Although
> CodeGenPrepare attempts to sink a GEP to the same BB as its
> load/store user, it doesn't sink GEPs with two variadic indices
> (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx
> doesn't support any addressing mode that can fold this GEP

That makes sense: because NVPTX does not support r+r.

> (This
> addressing mode issue is worth another thread, and one of my
> colleagues is working on that).

Looking forward to it.

 -Hal

>
>
>
> - To make this target independent, I think you just need to insert
> some calls to TLI.isLegalAddressingMode (or equivalently,
> TTI.isLegalAddressingMode) just to make sure that the offsets you're
> creating are legal on the target. This will essentially be a noop
> for NVPTX, but will matter for other targets.
>
>
>
> Ack'ed
>
>
>
> Thanks for posting this,
> Hal
>
>
>
> ----- Original Message -----
> > From: "Jingyue Wu" < [hidden email] >
> > To: [hidden email] , "Eli Bendersky" < [hidden email] >,
> > "Justin Holewinski" < [hidden email] >, "Justin
> > Holewinski" < [hidden email] >
> > Sent: Saturday, April 19, 2014 12:02:28 AM
> > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> > group of similar GEPs
> >
> >
> >
> >
> > Hi,
> >
> >
> > We wrote an optimization that eliminates common sub-expressions in
> > a
> > group of similar GEPs for the NVPTX backend. It speeds up some of
> > our benchmarks by up to 20%, which convinces us to try to upstream
> > it. Here's a brief description of why we wrote this optimization,
> > what we did, and how we did it.
> >
> >
> > Loops in CUDA programs are often extensively unrolled by
> > programmers
> > and compilers, leading to many similar
> > GEPs for array accesses.
> >
> >
> > e.g., a 2-level loop like
> >
> >
> > __shared__ float a[32][32];
> > unroll for (int i = 0; i < 2; ++i) {
> > unroll for (int j = 0; j < 2; ++j) {
> > ...
> > ... = a[threadIdx.x + i][threadIdx.y + j];
> > ...
> > }
> > }
> >
> >
> > will be unrolled to:
> >
> >
> > gep a, 0, tid.x, tid.y; load
> > gep a, 0, tid.x, tid.y + 1; load
> > gep a, 0, tid.x + 1, tid.y; load
> > gep a, 0, tid.x + 1, tid.y + 1; load
> >
> >
> > The NVPTX backend currently doesn't handle many similar
> > multi-dimensional GEPs
> > well enough. It emits PTX code that literally computes the pointer
> > address of
> > each GEP, wasting tons of registers. e.g., it emits the following
> > PTX
> > for the
> > first load and similar PTX for other loads.
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6];
> >
> >
> > The resultant register pressure causes up to 20% slowdown on some
> > of
> > our
> > benchmarks.
> >
> >
> > To reduce register pressure, the optimization implemented in this
> > patch merges
> > the common subexpression in a group of GEPs, saving many registers
> > used for
> > pointer arithmetics. It works by splitting each GEP into a variadic
> > base and a
> > constant offset. The variadic base can be computed once and reused
> > by
> > multiple
> > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > base+offset
> > addressing mode without using any extra register. e.g., we
> > transform
> > the four
> > GEPs and four loads in the above example conceptually into:
> >
> >
> > base = gep a, 0, x, y
> > load base
> > laod base + 1 * sizeof(float)
> > load base + 32 * sizeof(float)
> > load base + 33 * sizeof(float)
> >
> >
> > The resultant PTX code will look like:
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > ld.shared.f32 %f2, [%rl6+4]; // much better
> > ld.shared.f32 %f3, [%rl6+128]; // much better
> > ld.shared.f32 %f4, [%rl6+132]; // much better
> >
> >
> > which uses much fewer registers than the unoptimized PTX.
> >
> >
> > I am attaching a proof-of-concept patch. It fully implements our
> > idea
> > and contains a contrived test case to demonstrate how it works. It
> > also discusses why our implementation is safe in terms that the
> > optimization won't cause new undefined behavior. There's more work
> > that needs to be done, e.g., adding more tests. If this idea sounds
> > good to you, we will improve the patch and send it out for code
> > review.
> >
> >
> >
> > Thanks,
> > Jingyue
> > _______________________________________________
> > LLVM Developers mailing list
> > [hidden email] http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hal Finkel
----- Original Message -----

> From: "Hal Finkel" <[hidden email]>
> To: "Jingyue Wu" <[hidden email]>
> Cc: "Justin Holewinski" <[hidden email]>, [hidden email]
> Sent: Sunday, April 20, 2014 10:01:49 PM
> Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
>
> ----- Original Message -----
> > From: "Jingyue Wu" <[hidden email]>
> > To: "Hal Finkel" <[hidden email]>
> > Cc: [hidden email], "Eli Bendersky" <[hidden email]>,
> > "Justin Holewinski" <[hidden email]>, "Justin
> > Holewinski" <[hidden email]>
> > Sent: Sunday, April 20, 2014 9:34:57 PM
> > Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in
> > a group of similar GEPs
> >
> >
> > Hi Hal,
> >
> >
> > Thanks for your comments! I'm inlining my responses below.
> >
> >
> > Jingyue
> >
> >
> >
> > On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < [hidden email] >
> > wrote:
> >
> >
> > Jingyue,
> >
> > I can't speak for the NVPTX backend, but I think this looks useful
> > as
> > an (optional) target-independent pass. A few thoughts:
> >
> > - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> > instead? (When I was working on the BB vectorizer I was first using
> > GVN for cleanup afterward, someone suggested trying EarlyCSE
> > instead, the performance slowdown was a bit less than 1% on
> > average,
> > but the runtime impact was much less).
> >
> >
> >
> > EarlyCSE surprisingly generates code that runs 10% slower than GVN
> > on
> > one of our benchmarks. However, after looking into the ll and ptx,
> > we probably shouldn't blame EarlyCSE for the slowdown. The ptx
> > generated using EarlyCSE is only slightly different from that
> > generated by GVN, but ptxas amplifies the seemingly unharmful
> > difference to much more register usage. We need to further
> > investigate this issue.
> >
> >
> > I can change the code to use EarlyCSE by default, and leave a flag
> > to
> > run GVN.
>
> I think it is also perfectly reasonable (and perhaps even makes
> sense) to choose GVN for CodeGenOpt::Aggressive and EarlyCSE
> otherwise.

Also, do you see any substantial speedups just from running a late GVN (or EarlyCSE) independent of the GEP optimization?

 -Hal

>
> >
> >
> >
> > - Are you doing this on the IR level, instead of in DAGCombine,
> > because you want the cross-block combining from GVN? Or some other
> > reason (or both)?
> >
> >
> >
> > Cross-block combining from GVN is one of the reasons. The second
> > reason is we observed a GEP and its index are sometimes defined in
> > different BBs.
>
> Makes sense.
>
> > The third reason is we also observed a GEP and the
> > load/store that uses it are sometimes in different BBs. Although
> > CodeGenPrepare attempts to sink a GEP to the same BB as its
> > load/store user, it doesn't sink GEPs with two variadic indices
> > (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx
> > doesn't support any addressing mode that can fold this GEP
>
> That makes sense: because NVPTX does not support r+r.
>
> > (This
> > addressing mode issue is worth another thread, and one of my
> > colleagues is working on that).
>
> Looking forward to it.
>
>  -Hal
>
> >
> >
> >
> > - To make this target independent, I think you just need to insert
> > some calls to TLI.isLegalAddressingMode (or equivalently,
> > TTI.isLegalAddressingMode) just to make sure that the offsets
> > you're
> > creating are legal on the target. This will essentially be a noop
> > for NVPTX, but will matter for other targets.
> >
> >
> >
> > Ack'ed
> >
> >
> >
> > Thanks for posting this,
> > Hal
> >
> >
> >
> > ----- Original Message -----
> > > From: "Jingyue Wu" < [hidden email] >
> > > To: [hidden email] , "Eli Bendersky" < [hidden email] >,
> > > "Justin Holewinski" < [hidden email] >, "Justin
> > > Holewinski" < [hidden email] >
> > > Sent: Saturday, April 19, 2014 12:02:28 AM
> > > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> > > group of similar GEPs
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> > > We wrote an optimization that eliminates common sub-expressions
> > > in
> > > a
> > > group of similar GEPs for the NVPTX backend. It speeds up some of
> > > our benchmarks by up to 20%, which convinces us to try to
> > > upstream
> > > it. Here's a brief description of why we wrote this optimization,
> > > what we did, and how we did it.
> > >
> > >
> > > Loops in CUDA programs are often extensively unrolled by
> > > programmers
> > > and compilers, leading to many similar
> > > GEPs for array accesses.
> > >
> > >
> > > e.g., a 2-level loop like
> > >
> > >
> > > __shared__ float a[32][32];
> > > unroll for (int i = 0; i < 2; ++i) {
> > > unroll for (int j = 0; j < 2; ++j) {
> > > ...
> > > ... = a[threadIdx.x + i][threadIdx.y + j];
> > > ...
> > > }
> > > }
> > >
> > >
> > > will be unrolled to:
> > >
> > >
> > > gep a, 0, tid.x, tid.y; load
> > > gep a, 0, tid.x, tid.y + 1; load
> > > gep a, 0, tid.x + 1, tid.y; load
> > > gep a, 0, tid.x + 1, tid.y + 1; load
> > >
> > >
> > > The NVPTX backend currently doesn't handle many similar
> > > multi-dimensional GEPs
> > > well enough. It emits PTX code that literally computes the
> > > pointer
> > > address of
> > > each GEP, wasting tons of registers. e.g., it emits the following
> > > PTX
> > > for the
> > > first load and similar PTX for other loads.
> > >
> > >
> > > mov.u32 %r1, %tid.x;
> > > mov.u32 %r2, %tid.y;
> > > mul.wide.u32 %rl2, %r1, 128;
> > > mov.u64 %rl3, a;
> > > add.s64 %rl4, %rl3, %rl2;
> > > mul.wide.u32 %rl5, %r2, 4;
> > > add.s64 %rl6, %rl4, %rl5;
> > > ld.shared.f32 %f1, [%rl6];
> > >
> > >
> > > The resultant register pressure causes up to 20% slowdown on some
> > > of
> > > our
> > > benchmarks.
> > >
> > >
> > > To reduce register pressure, the optimization implemented in this
> > > patch merges
> > > the common subexpression in a group of GEPs, saving many
> > > registers
> > > used for
> > > pointer arithmetics. It works by splitting each GEP into a
> > > variadic
> > > base and a
> > > constant offset. The variadic base can be computed once and
> > > reused
> > > by
> > > multiple
> > > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > > base+offset
> > > addressing mode without using any extra register. e.g., we
> > > transform
> > > the four
> > > GEPs and four loads in the above example conceptually into:
> > >
> > >
> > > base = gep a, 0, x, y
> > > load base
> > > laod base + 1 * sizeof(float)
> > > load base + 32 * sizeof(float)
> > > load base + 33 * sizeof(float)
> > >
> > >
> > > The resultant PTX code will look like:
> > >
> > >
> > > mov.u32 %r1, %tid.x;
> > > mov.u32 %r2, %tid.y;
> > > mul.wide.u32 %rl2, %r1, 128;
> > > mov.u64 %rl3, a;
> > > add.s64 %rl4, %rl3, %rl2;
> > > mul.wide.u32 %rl5, %r2, 4;
> > > add.s64 %rl6, %rl4, %rl5;
> > > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > > ld.shared.f32 %f2, [%rl6+4]; // much better
> > > ld.shared.f32 %f3, [%rl6+128]; // much better
> > > ld.shared.f32 %f4, [%rl6+132]; // much better
> > >
> > >
> > > which uses much fewer registers than the unoptimized PTX.
> > >
> > >
> > > I am attaching a proof-of-concept patch. It fully implements our
> > > idea
> > > and contains a contrived test case to demonstrate how it works.
> > > It
> > > also discusses why our implementation is safe in terms that the
> > > optimization won't cause new undefined behavior. There's more
> > > work
> > > that needs to be done, e.g., adding more tests. If this idea
> > > sounds
> > > good to you, we will improve the patch and send it out for code
> > > review.
> > >
> > >
> > >
> > > Thanks,
> > > Jingyue
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > [hidden email] http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu-2
In reply to this post by Hal Finkel
Hal, 

Would you put this pass in lib/CodeGen or lib/Transforms? lib/CodeGen sounds too low-level; lib/Transforms doesn't seem to encourage using TargetLowering or TargetTransformInfo. 

Jingyue


On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel <[hidden email]> wrote:
Jingyue,

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.

Thanks for posting this,
Hal

----- Original Message -----
> From: "Jingyue Wu" <[hidden email]>
> To: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Saturday, April 19, 2014 12:02:28 AM
> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of     similar GEPs
>
>
>
>
> Hi,
>
>
> We wrote an optimization that eliminates common sub-expressions in a
> group of similar GEPs for the NVPTX backend. It speeds up some of
> our benchmarks by up to 20%, which convinces us to try to upstream
> it. Here's a brief description of why we wrote this optimization,
> what we did, and how we did it.
>
>
> Loops in CUDA programs are often extensively unrolled by programmers
> and compilers, leading to many similar
> GEPs for array accesses.
>
>
> e.g., a 2-level loop like
>
>
> __shared__ float a[32][32];
> unroll for (int i = 0; i < 2; ++i) {
> unroll for (int j = 0; j < 2; ++j) {
> ...
> ... = a[threadIdx.x + i][threadIdx.y + j];
> ...
> }
> }
>
>
> will be unrolled to:
>
>
> gep a, 0, tid.x, tid.y; load
> gep a, 0, tid.x, tid.y + 1; load
> gep a, 0, tid.x + 1, tid.y; load
> gep a, 0, tid.x + 1, tid.y + 1; load
>
>
> The NVPTX backend currently doesn't handle many similar
> multi-dimensional GEPs
> well enough. It emits PTX code that literally computes the pointer
> address of
> each GEP, wasting tons of registers. e.g., it emits the following PTX
> for the
> first load and similar PTX for other loads.
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6];
>
>
> The resultant register pressure causes up to 20% slowdown on some of
> our
> benchmarks.
>
>
> To reduce register pressure, the optimization implemented in this
> patch merges
> the common subexpression in a group of GEPs, saving many registers
> used for
> pointer arithmetics. It works by splitting each GEP into a variadic
> base and a
> constant offset. The variadic base can be computed once and reused by
> multiple
> GEPs, and the constant offsets can be nicely folded into NVPTX's
> base+offset
> addressing mode without using any extra register. e.g., we transform
> the four
> GEPs and four loads in the above example conceptually into:
>
>
> base = gep a, 0, x, y
> load base
> laod base + 1 * sizeof(float)
> load base + 32 * sizeof(float)
> load base + 33 * sizeof(float)
>
>
> The resultant PTX code will look like:
>
>
> mov.u32 %r1, %tid.x;
> mov.u32 %r2, %tid.y;
> mul.wide.u32 %rl2, %r1, 128;
> mov.u64 %rl3, a;
> add.s64 %rl4, %rl3, %rl2;
> mul.wide.u32 %rl5, %r2, 4;
> add.s64 %rl6, %rl4, %rl5;
> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> ld.shared.f32 %f2, [%rl6+4]; // much better
> ld.shared.f32 %f3, [%rl6+128]; // much better
> ld.shared.f32 %f4, [%rl6+132]; // much better
>
>
> which uses much fewer registers than the unoptimized PTX.
>
>
> I am attaching a proof-of-concept patch. It fully implements our idea
> and contains a contrived test case to demonstrate how it works. It
> also discusses why our implementation is safe in terms that the
> optimization won't cause new undefined behavior. There's more work
> that needs to be done, e.g., adding more tests. If this idea sounds
> good to you, we will improve the patch and send it out for code
> review.
>
>
>
> Thanks,
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hal Finkel
----- Original Message -----

> From: "Jingyue Wu" <[hidden email]>
> To: "Hal Finkel" <[hidden email]>
> Cc: [hidden email], "Eli Bendersky" <[hidden email]>, "Justin Holewinski" <[hidden email]>, "Justin
> Holewinski" <[hidden email]>
> Sent: Monday, April 21, 2014 3:15:38 PM
> Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
>
>
> Hal,
>
>
> Would you put this pass in lib/CodeGen or lib/Transforms? lib/CodeGen
> sounds too low-level; lib/Transforms doesn't seem to encourage using
> TargetLowering or TargetTransformInfo.

In this regard, I don't like the current setup either. However, things being as they are, I'd put it in lib/Transforms/Scalar, because that is where LoopStrengthReduce is. I think that LSR is the closest cousin to your pass, and it uses TTI to check address mode legality.

 -Hal

>
>
> Jingyue
>
>
>
> On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < [hidden email] >
> wrote:
>
>
> Jingyue,
>
> I can't speak for the NVPTX backend, but I think this looks useful as
> an (optional) target-independent pass. A few thoughts:
>
> - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> instead? (When I was working on the BB vectorizer I was first using
> GVN for cleanup afterward, someone suggested trying EarlyCSE
> instead, the performance slowdown was a bit less than 1% on average,
> but the runtime impact was much less).
>
> - Are you doing this on the IR level, instead of in DAGCombine,
> because you want the cross-block combining from GVN? Or some other
> reason (or both)?
>
> - To make this target independent, I think you just need to insert
> some calls to TLI.isLegalAddressingMode (or equivalently,
> TTI.isLegalAddressingMode) just to make sure that the offsets you're
> creating are legal on the target. This will essentially be a noop
> for NVPTX, but will matter for other targets.
>
> Thanks for posting this,
> Hal
>
>
>
> ----- Original Message -----
> > From: "Jingyue Wu" < [hidden email] >
> > To: [hidden email] , "Eli Bendersky" < [hidden email] >,
> > "Justin Holewinski" < [hidden email] >, "Justin
> > Holewinski" < [hidden email] >
> > Sent: Saturday, April 19, 2014 12:02:28 AM
> > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> > group of similar GEPs
> >
> >
> >
> >
> > Hi,
> >
> >
> > We wrote an optimization that eliminates common sub-expressions in
> > a
> > group of similar GEPs for the NVPTX backend. It speeds up some of
> > our benchmarks by up to 20%, which convinces us to try to upstream
> > it. Here's a brief description of why we wrote this optimization,
> > what we did, and how we did it.
> >
> >
> > Loops in CUDA programs are often extensively unrolled by
> > programmers
> > and compilers, leading to many similar
> > GEPs for array accesses.
> >
> >
> > e.g., a 2-level loop like
> >
> >
> > __shared__ float a[32][32];
> > unroll for (int i = 0; i < 2; ++i) {
> > unroll for (int j = 0; j < 2; ++j) {
> > ...
> > ... = a[threadIdx.x + i][threadIdx.y + j];
> > ...
> > }
> > }
> >
> >
> > will be unrolled to:
> >
> >
> > gep a, 0, tid.x, tid.y; load
> > gep a, 0, tid.x, tid.y + 1; load
> > gep a, 0, tid.x + 1, tid.y; load
> > gep a, 0, tid.x + 1, tid.y + 1; load
> >
> >
> > The NVPTX backend currently doesn't handle many similar
> > multi-dimensional GEPs
> > well enough. It emits PTX code that literally computes the pointer
> > address of
> > each GEP, wasting tons of registers. e.g., it emits the following
> > PTX
> > for the
> > first load and similar PTX for other loads.
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6];
> >
> >
> > The resultant register pressure causes up to 20% slowdown on some
> > of
> > our
> > benchmarks.
> >
> >
> > To reduce register pressure, the optimization implemented in this
> > patch merges
> > the common subexpression in a group of GEPs, saving many registers
> > used for
> > pointer arithmetics. It works by splitting each GEP into a variadic
> > base and a
> > constant offset. The variadic base can be computed once and reused
> > by
> > multiple
> > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > base+offset
> > addressing mode without using any extra register. e.g., we
> > transform
> > the four
> > GEPs and four loads in the above example conceptually into:
> >
> >
> > base = gep a, 0, x, y
> > load base
> > laod base + 1 * sizeof(float)
> > load base + 32 * sizeof(float)
> > load base + 33 * sizeof(float)
> >
> >
> > The resultant PTX code will look like:
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > ld.shared.f32 %f2, [%rl6+4]; // much better
> > ld.shared.f32 %f3, [%rl6+128]; // much better
> > ld.shared.f32 %f4, [%rl6+132]; // much better
> >
> >
> > which uses much fewer registers than the unoptimized PTX.
> >
> >
> > I am attaching a proof-of-concept patch. It fully implements our
> > idea
> > and contains a contrived test case to demonstrate how it works. It
> > also discusses why our implementation is safe in terms that the
> > optimization won't cause new undefined behavior. There's more work
> > that needs to be done, e.g., adding more tests. If this idea sounds
> > good to you, we will improve the patch and send it out for code
> > review.
> >
> >
> >
> > Thanks,
> > Jingyue
> > _______________________________________________
> > LLVM Developers mailing list
> > [hidden email] http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu-2
In reply to this post by Hal Finkel
No, I don't observe any speedup. 


On Sun, Apr 20, 2014 at 9:02 PM, Hal Finkel <[hidden email]> wrote:
----- Original Message -----
> From: "Hal Finkel" <[hidden email]>
> To: "Jingyue Wu" <[hidden email]>
> Cc: "Justin Holewinski" <[hidden email]>, [hidden email]
> Sent: Sunday, April 20, 2014 10:01:49 PM
> Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
>
> ----- Original Message -----
> > From: "Jingyue Wu" <[hidden email]>
> > To: "Hal Finkel" <[hidden email]>
> > Cc: [hidden email], "Eli Bendersky" <[hidden email]>,
> > "Justin Holewinski" <[hidden email]>, "Justin
> > Holewinski" <[hidden email]>
> > Sent: Sunday, April 20, 2014 9:34:57 PM
> > Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in
> > a group of similar GEPs
> >
> >
> > Hi Hal,
> >
> >
> > Thanks for your comments! I'm inlining my responses below.
> >
> >
> > Jingyue
> >
> >
> >
> > On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < [hidden email] >
> > wrote:
> >
> >
> > Jingyue,
> >
> > I can't speak for the NVPTX backend, but I think this looks useful
> > as
> > an (optional) target-independent pass. A few thoughts:
> >
> > - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> > instead? (When I was working on the BB vectorizer I was first using
> > GVN for cleanup afterward, someone suggested trying EarlyCSE
> > instead, the performance slowdown was a bit less than 1% on
> > average,
> > but the runtime impact was much less).
> >
> >
> >
> > EarlyCSE surprisingly generates code that runs 10% slower than GVN
> > on
> > one of our benchmarks. However, after looking into the ll and ptx,
> > we probably shouldn't blame EarlyCSE for the slowdown. The ptx
> > generated using EarlyCSE is only slightly different from that
> > generated by GVN, but ptxas amplifies the seemingly unharmful
> > difference to much more register usage. We need to further
> > investigate this issue.
> >
> >
> > I can change the code to use EarlyCSE by default, and leave a flag
> > to
> > run GVN.
>
> I think it is also perfectly reasonable (and perhaps even makes
> sense) to choose GVN for CodeGenOpt::Aggressive and EarlyCSE
> otherwise.

Also, do you see any substantial speedups just from running a late GVN (or EarlyCSE) independent of the GEP optimization?

 -Hal

>
> >
> >
> >
> > - Are you doing this on the IR level, instead of in DAGCombine,
> > because you want the cross-block combining from GVN? Or some other
> > reason (or both)?
> >
> >
> >
> > Cross-block combining from GVN is one of the reasons. The second
> > reason is we observed a GEP and its index are sometimes defined in
> > different BBs.
>
> Makes sense.
>
> > The third reason is we also observed a GEP and the
> > load/store that uses it are sometimes in different BBs. Although
> > CodeGenPrepare attempts to sink a GEP to the same BB as its
> > load/store user, it doesn't sink GEPs with two variadic indices
> > (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx
> > doesn't support any addressing mode that can fold this GEP
>
> That makes sense: because NVPTX does not support r+r.
>
> > (This
> > addressing mode issue is worth another thread, and one of my
> > colleagues is working on that).
>
> Looking forward to it.
>
>  -Hal
>
> >
> >
> >
> > - To make this target independent, I think you just need to insert
> > some calls to TLI.isLegalAddressingMode (or equivalently,
> > TTI.isLegalAddressingMode) just to make sure that the offsets
> > you're
> > creating are legal on the target. This will essentially be a noop
> > for NVPTX, but will matter for other targets.
> >
> >
> >
> > Ack'ed
> >
> >
> >
> > Thanks for posting this,
> > Hal
> >
> >
> >
> > ----- Original Message -----
> > > From: "Jingyue Wu" < [hidden email] >
> > > To: [hidden email] , "Eli Bendersky" < [hidden email] >,
> > > "Justin Holewinski" < [hidden email] >, "Justin
> > > Holewinski" < [hidden email] >
> > > Sent: Saturday, April 19, 2014 12:02:28 AM
> > > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> > > group of similar GEPs
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> > > We wrote an optimization that eliminates common sub-expressions
> > > in
> > > a
> > > group of similar GEPs for the NVPTX backend. It speeds up some of
> > > our benchmarks by up to 20%, which convinces us to try to
> > > upstream
> > > it. Here's a brief description of why we wrote this optimization,
> > > what we did, and how we did it.
> > >
> > >
> > > Loops in CUDA programs are often extensively unrolled by
> > > programmers
> > > and compilers, leading to many similar
> > > GEPs for array accesses.
> > >
> > >
> > > e.g., a 2-level loop like
> > >
> > >
> > > __shared__ float a[32][32];
> > > unroll for (int i = 0; i < 2; ++i) {
> > > unroll for (int j = 0; j < 2; ++j) {
> > > ...
> > > ... = a[threadIdx.x + i][threadIdx.y + j];
> > > ...
> > > }
> > > }
> > >
> > >
> > > will be unrolled to:
> > >
> > >
> > > gep a, 0, tid.x, tid.y; load
> > > gep a, 0, tid.x, tid.y + 1; load
> > > gep a, 0, tid.x + 1, tid.y; load
> > > gep a, 0, tid.x + 1, tid.y + 1; load
> > >
> > >
> > > The NVPTX backend currently doesn't handle many similar
> > > multi-dimensional GEPs
> > > well enough. It emits PTX code that literally computes the
> > > pointer
> > > address of
> > > each GEP, wasting tons of registers. e.g., it emits the following
> > > PTX
> > > for the
> > > first load and similar PTX for other loads.
> > >
> > >
> > > mov.u32 %r1, %tid.x;
> > > mov.u32 %r2, %tid.y;
> > > mul.wide.u32 %rl2, %r1, 128;
> > > mov.u64 %rl3, a;
> > > add.s64 %rl4, %rl3, %rl2;
> > > mul.wide.u32 %rl5, %r2, 4;
> > > add.s64 %rl6, %rl4, %rl5;
> > > ld.shared.f32 %f1, [%rl6];
> > >
> > >
> > > The resultant register pressure causes up to 20% slowdown on some
> > > of
> > > our
> > > benchmarks.
> > >
> > >
> > > To reduce register pressure, the optimization implemented in this
> > > patch merges
> > > the common subexpression in a group of GEPs, saving many
> > > registers
> > > used for
> > > pointer arithmetics. It works by splitting each GEP into a
> > > variadic
> > > base and a
> > > constant offset. The variadic base can be computed once and
> > > reused
> > > by
> > > multiple
> > > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > > base+offset
> > > addressing mode without using any extra register. e.g., we
> > > transform
> > > the four
> > > GEPs and four loads in the above example conceptually into:
> > >
> > >
> > > base = gep a, 0, x, y
> > > load base
> > > laod base + 1 * sizeof(float)
> > > load base + 32 * sizeof(float)
> > > load base + 33 * sizeof(float)
> > >
> > >
> > > The resultant PTX code will look like:
> > >
> > >
> > > mov.u32 %r1, %tid.x;
> > > mov.u32 %r2, %tid.y;
> > > mul.wide.u32 %rl2, %r1, 128;
> > > mov.u64 %rl3, a;
> > > add.s64 %rl4, %rl3, %rl2;
> > > mul.wide.u32 %rl5, %r2, 4;
> > > add.s64 %rl6, %rl4, %rl5;
> > > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > > ld.shared.f32 %f2, [%rl6+4]; // much better
> > > ld.shared.f32 %f3, [%rl6+128]; // much better
> > > ld.shared.f32 %f4, [%rl6+132]; // much better
> > >
> > >
> > > which uses much fewer registers than the unoptimized PTX.
> > >
> > >
> > > I am attaching a proof-of-concept patch. It fully implements our
> > > idea
> > > and contains a contrived test case to demonstrate how it works.
> > > It
> > > also discusses why our implementation is safe in terms that the
> > > optimization won't cause new undefined behavior. There's more
> > > work
> > > that needs to be done, e.g., adding more tests. If this idea
> > > sounds
> > > good to you, we will improve the patch and send it out for code
> > > review.
> > >
> > >
> > >
> > > Thanks,
> > > Jingyue
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > [hidden email] http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev