[llvm-dev] Wide load/store optimization question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
Hi,

I'm trying to write an LLVM backend for Epiphany arch, and I wonder if someone can give me some advice on how to implement load/store optimization. The CPU itself is 32-bit, but it supports wider 64-bit loads and store. So the basic idea is to make use of those by combining narrow ones.

I've checked how it is done in AArch64 and Hexagon, and my current code is very close to the AArch64 one (used it as a kick-off). The problem lies in constraints imposed by the platform.

The main constraint is that regs used should be sequential, lower reg should be even/zero. And obviously frame offsets should be sequential to be merged, dword-aligned for the lower reg offset.

Because of those constraints I'm currently running this pass on pre-emit, after RA and frame finalization. But at that point most of the choices made (RA, frame offsets), and those are obviously suboptimal. The most common issue can look somehow like this:
    str r1, [fp, -4]
    str r2, [fp, -8]
Those two stores can't be merged because the lower reg (r1) is not even. To merge them, r1 should be changed to r0, and r2 to r1. Sometimes the same problem happens when the frame offset is misaligned, e.g. r0 will have offset aligned to word, not dword.

Can someone please point me out in which direction should I move? And also - at which step should I apply such pass? If on PreRA - how to set reg constraints such as regsequence, as well as frame constraints? If before frame finalization - how to  set frame constraints? If on pre-emit like i'm doing now - how to optimize and rewrite frame offsets and regs?

Thanks,
Petr

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
Hi Peter,

  For i64, our custom backend only support load/store instruction. We refer to Sparc,
making load i64 as load v2i32 rather then two load i32 (LLVM default Expand). I would
be happy to hear other's experience on this, too.

HTH,
chenwj


2017-06-13 23:44 GMT+08:00 Peter Bel via llvm-dev <[hidden email]>:
Hi,

I'm trying to write an LLVM backend for Epiphany arch, and I wonder if someone can give me some advice on how to implement load/store optimization. The CPU itself is 32-bit, but it supports wider 64-bit loads and store. So the basic idea is to make use of those by combining narrow ones.

I've checked how it is done in AArch64 and Hexagon, and my current code is very close to the AArch64 one (used it as a kick-off). The problem lies in constraints imposed by the platform.

The main constraint is that regs used should be sequential, lower reg should be even/zero. And obviously frame offsets should be sequential to be merged, dword-aligned for the lower reg offset.

Because of those constraints I'm currently running this pass on pre-emit, after RA and frame finalization. But at that point most of the choices made (RA, frame offsets), and those are obviously suboptimal. The most common issue can look somehow like this:
    str r1, [fp, -4]
    str r2, [fp, -8]
Those two stores can't be merged because the lower reg (r1) is not even. To merge them, r1 should be changed to r0, and r2 to r1. Sometimes the same problem happens when the frame offset is misaligned, e.g. r0 will have offset aligned to word, not dword.

Can someone please point me out in which direction should I move? And also - at which step should I apply such pass? If on PreRA - how to set reg constraints such as regsequence, as well as frame constraints? If before frame finalization - how to  set frame constraints? If on pre-emit like i'm doing now - how to optimize and rewrite frame offsets and regs?

Thanks,
Petr

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
In reply to this post by ORiordan, Martin via llvm-dev
On 06/13/2017 11:44 AM, Peter Bel via llvm-dev wrote:

> Hi,
>
> I'm trying to write an LLVM backend for Epiphany arch, and I wonder if someone can give me some advice on how to implement load/store optimization. The CPU itself is 32-bit, but it supports wider 64-bit loads and store. So the basic idea is to make use of those by combining narrow ones.
>
> I've checked how it is done in AArch64 and Hexagon, and my current code is very close to the AArch64 one (used it as a kick-off). The problem lies in constraints imposed by the platform.
>
> The main constraint is that regs used should be sequential, lower reg should be even/zero. And obviously frame offsets should be sequential to be merged, dword-aligned for the lower reg offset.
>
> Because of those constraints I'm currently running this pass on pre-emit, after RA and frame finalization. But at that point most of the choices made (RA, frame offsets), and those are obviously suboptimal. The most common issue can look somehow like this:
>     str r1, [fp, -4]
>     str r2, [fp, -8]
> Those two stores can't be merged because the lower reg (r1) is not even. To merge them, r1 should be changed to r0, and r2 to r1. Sometimes the same problem happens when the frame offset is misaligned, e.g. r0 will have offset aligned to word, not dword.
>
> Can someone please point me out in which direction should I move? And also - at which step should I apply such pass? If on PreRA - how to set reg constraints such as regsequence, as well as frame constraints? If before frame finalization - how to  set frame constraints? If on pre-emit like i'm doing now - how to optimize and rewrite frame offsets and regs?
>

One thing you can do is define a register class that is made up of register
tuples e.g. r0r1, r2r3, etc., and use that register class for the 64-bit
load/store instructions.  This will allow you to do the load/store
merging before register allocation without the register constraints.

The AMDGPU backend has similar alignment constraints for its
SGPR classes, where if you are writing to N-consecutive SGPRs,
then the lower register index must be divisible by N.

-Tom

> Thanks,
> Petr
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
One thing you can do is define a register class that is made up of register
tuples e.g. r0r1, r2r3, etc., and use that register class for the 64-bit
load/store instructions.  This will allow you to do the load/store
merging before register allocation without the register constraints.

​Our backend only support load/store for i64 type, hence i64 is not legal for us.
I guess Peter's Epiphany arch has similar situation.​

IIRC, LLVM expand load i64 to two load i32. Right now, we have to custom
lowering load i64 to load v2i32, then map v2i32 to the tuple register (similar
to Sparc backend). How can we use the tuple register for those two i32? 
​Any existing example?

Regards,
chenwj​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
In reply to this post by ORiordan, Martin via llvm-dev
Hi,

Same here, my backend only has 64bit load/store. But i still use 64bit virt regs and expand/declare missing instructions by myself. 

I'll try looking into sparc backend, thanks. Also, only after writing this post I found a bunch of built-in transforms. Still trying to understand how to use those.

By the way, constraint-wise (alignment), is there any difference between virt regclass and regtuple?

Best regards,
Petr



Отправлено со смартфона Samsung Galaxy.

-------- Исходное сообщение --------
От: 陳韋任 <[hidden email]>
Дата: 16.06.17 22:03 (GMT+02:00)
Кому: [hidden email]
Копия: Peter Bel <[hidden email]>, LLVM Developers Mailing List <[hidden email]>
Тема: Re: [llvm-dev] Wide load/store optimization question

One thing you can do is define a register class that is made up of register
tuples e.g. r0r1, r2r3, etc., and use that register class for the 64-bit
load/store instructions.  This will allow you to do the load/store
merging before register allocation without the register constraints.

​Our backend only support load/store for i64 type, hence i64 is not legal for us.
I guess Peter's Epiphany arch has similar situation.​

IIRC, LLVM expand load i64 to two load i32. Right now, we have to custom
lowering load i64 to load v2i32, then map v2i32 to the tuple register (similar
to Sparc backend). How can we use the tuple register for those two i32? 
​Any existing example?

Regards,
chenwj​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev


2017-06-17 4:36 GMT+08:00 upcfrost <[hidden email]>:
Hi,

Same here, my backend only has 64bit load/store. But i still use 64bit virt regs and expand/declare missing instructions by myself. 

I'll try looking into sparc backend, thanks. Also, only after writing this post I found a bunch of built-in transforms. Still trying to understand how to use those.

By the way, constraint-wise (alignment), is there any difference between virt regclass and regtuple?

​I guess no. ​Tom?

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev

On Jun 16, 2017, at 2:43 PM, 陳韋任 via llvm-dev <[hidden email]> wrote:



2017-06-17 4:36 GMT+08:00 upcfrost <[hidden email]>:
Hi,

Same here, my backend only has 64bit load/store. But i still use 64bit virt regs and expand/declare missing instructions by myself. 

I'll try looking into sparc backend, thanks. Also, only after writing this post I found a bunch of built-in transforms. Still trying to understand how to use those.

By the way, constraint-wise (alignment), is there any difference between virt regclass and regtuple?

That question makes no sense.
- Every virtual register has a register class assigned.
- You can construct special register classes that represent register tuples so that when the allocator chooses an entry from that register class it really has choosen a tuple of machine registers (even though it looks like a single register with funny aliasing as far as llvm codegen is concerned).

- Matthias

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
That question makes no sense.
- Every virtual register has a register class assigned.
- You can construct special register classes that represent register tuples so that when the allocator chooses an entry from that register class it really has choosen a tuple of machine registers (even though it looks like a single register with funny aliasing as far as llvm codegen is concerned).

​And we still have to lower load i64 to load v2i32, right?​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
In reply to this post by ORiordan, Martin via llvm-dev
For who might interest. Just find a link


which talks about current Sparc implementation.

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
Hi James,

​  Sorry for the bothering. On previous discussion on http://lists.llvm.org/pipermail/llvm-dev/2017-June/114248.html ,
I suggest others refering to Sparc implementation, lowering load i64 to load v2i32.​ But I find one problem recently,
it's about bitfield access. If one struct with bitfield has 64-bit size, the bitfield access generates a lot load/store
instructions, which come from load v2i32. Are you aware of such problem?

Thanks.

Regards,
chenwj

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
In reply to this post by ORiordan, Martin via llvm-dev
Hi,

I've looked through both AMDGPU and Sparc backends, and it seems they also do not perform the thing I want to make. The only backend which is doing it is AArch64, but it doesn't have reg constraints.
So, just with an example. I have the following C code:

void test()
{
  int a = 1; int b = 2; int c = 3; int d = 4;
  a++; b++; c++; d++;
}

Without any frontend optimization is compiles to the following IR.

define void @test(i32* %z) #0 {
  %1 = alloca i32*, align 4
  %a = alloca i32, align 4
  %b = alloca i32, align 4
  %c = alloca i32, align 4
  %d = alloca i32, align 4
  store i32* %z, i32** %1, align 4
  store i32 1, i32* %a, align 4
  store i32 2, i32* %b, align 4
  store i32 3, i32* %c, align 4
  store i32 4, i32* %d, align 4
  %2 = load i32, i32* %a, align 4
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* %a, align 4
  %4 = load i32, i32* %b, align 4
  %5 = add nsw i32 %4, 1
  store i32 %5, i32* %b, align 4
  .....
}

Which produces the following asm code.

        mov     r2, #1
        str     r2, [fp, #-2]
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-3]
        str     r2, [fp, #-4]
        mov     r3, #4
        ldr     r2, [fp, #-2]
        str     r3, [fp, #-5]
        .....

What I want to do is to merge neighboring stores and loads. For example
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-5]
        str     r2, [fp, #-4]
Can be converted to
        mov     r3, #2
        mov     r2, #3
        strd    r2, [fp, #-4]
But the main problem is that the offset for r3 in the snippet above was -3, not -5.

Currently, i'm doing the following. During the pre-RA i'm creating a REG_SEQUENCE with the target class, assigning vregs in question as its subregs, and create a load/store inst for the sequence with mem references merged.
It solves the register constraint problem, but the frame allocation problem still exists. Probably I'll need to use fixed stack objects and manually pre-allocate the frame, which i really don't want to do as it can break some other passes.

Petr


On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <[hidden email]> wrote:
That question makes no sense.
- Every virtual register has a register class assigned.
- You can construct special register classes that represent register tuples so that when the allocator chooses an entry from that register class it really has choosen a tuple of machine registers (even though it looks like a single register with funny aliasing as far as llvm codegen is concerned).

​And we still have to lower load i64 to load v2i32, right?​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
Well, that is now a slightly different question.

Once the compiler can do 64-bit loads/stores for a 64-bit integer type (e.g. C long long), then an optimization pass should be merging the loads/stores before register allocation, so that appropriate registers can be chosen.


On Wed, Jun 28, 2017 at 5:43 AM, Peter Bel via llvm-dev <[hidden email]> wrote:
Hi,

I've looked through both AMDGPU and Sparc backends, and it seems they also do not perform the thing I want to make. The only backend which is doing it is AArch64, but it doesn't have reg constraints.
So, just with an example. I have the following C code:

void test()
{
  int a = 1; int b = 2; int c = 3; int d = 4;
  a++; b++; c++; d++;
}

Without any frontend optimization is compiles to the following IR.

define void @test(i32* %z) #0 {
  %1 = alloca i32*, align 4
  %a = alloca i32, align 4
  %b = alloca i32, align 4
  %c = alloca i32, align 4
  %d = alloca i32, align 4
  store i32* %z, i32** %1, align 4
  store i32 1, i32* %a, align 4
  store i32 2, i32* %b, align 4
  store i32 3, i32* %c, align 4
  store i32 4, i32* %d, align 4
  %2 = load i32, i32* %a, align 4
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* %a, align 4
  %4 = load i32, i32* %b, align 4
  %5 = add nsw i32 %4, 1
  store i32 %5, i32* %b, align 4
  .....
}

Which produces the following asm code.

        mov     r2, #1
        str     r2, [fp, #-2]
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-3]
        str     r2, [fp, #-4]
        mov     r3, #4
        ldr     r2, [fp, #-2]
        str     r3, [fp, #-5]
        .....

What I want to do is to merge neighboring stores and loads. For example
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-5]
        str     r2, [fp, #-4]
Can be converted to
        mov     r3, #2
        mov     r2, #3
        strd    r2, [fp, #-4]
But the main problem is that the offset for r3 in the snippet above was -3, not -5.

Currently, i'm doing the following. During the pre-RA i'm creating a REG_SEQUENCE with the target class, assigning vregs in question as its subregs, and create a load/store inst for the sequence with mem references merged.
It solves the register constraint problem, but the frame allocation problem still exists. Probably I'll need to use fixed stack objects and manually pre-allocate the frame, which i really don't want to do as it can break some other passes.

Petr


On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <[hidden email]> wrote:
That question makes no sense.
- Every virtual register has a register class assigned.
- You can construct special register classes that represent register tuples so that when the allocator chooses an entry from that register class it really has choosen a tuple of machine registers (even though it looks like a single register with funny aliasing as far as llvm codegen is concerned).

​And we still have to lower load i64 to load v2i32, right?​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Wide load/store optimization question

ORiordan, Martin via llvm-dev
That's what I've managed to figure out so far. As vreg should have only one def and one kill (plz correct me if I'm wrong), there shouldn't be any collision while merging them, though it might increase reg pressure.

But the frame index reference problem is still there. I need both references to be sequential, an the lower one should be dword-aligned. If i'll just add dword alignment to the lower subreg, it might result in case when both of subregs will be dword-aligned with an empty word between them. There's a number of other funny cases possible.

In short, I just don't know how to glue two frame indexes together into a single block. It's possible to go with fixed frame objects, but I'd prefer to leave this way as a last resort as it may cripple some of the passes coming up later.

Petr



On Wed, Jun 28, 2017 at 4:19 PM, James Y Knight <[hidden email]> wrote:
Well, that is now a slightly different question.

Once the compiler can do 64-bit loads/stores for a 64-bit integer type (e.g. C long long), then an optimization pass should be merging the loads/stores before register allocation, so that appropriate registers can be chosen.


On Wed, Jun 28, 2017 at 5:43 AM, Peter Bel via llvm-dev <[hidden email]> wrote:
Hi,

I've looked through both AMDGPU and Sparc backends, and it seems they also do not perform the thing I want to make. The only backend which is doing it is AArch64, but it doesn't have reg constraints.
So, just with an example. I have the following C code:

void test()
{
  int a = 1; int b = 2; int c = 3; int d = 4;
  a++; b++; c++; d++;
}

Without any frontend optimization is compiles to the following IR.

define void @test(i32* %z) #0 {
  %1 = alloca i32*, align 4
  %a = alloca i32, align 4
  %b = alloca i32, align 4
  %c = alloca i32, align 4
  %d = alloca i32, align 4
  store i32* %z, i32** %1, align 4
  store i32 1, i32* %a, align 4
  store i32 2, i32* %b, align 4
  store i32 3, i32* %c, align 4
  store i32 4, i32* %d, align 4
  %2 = load i32, i32* %a, align 4
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* %a, align 4
  %4 = load i32, i32* %b, align 4
  %5 = add nsw i32 %4, 1
  store i32 %5, i32* %b, align 4
  .....
}

Which produces the following asm code.

        mov     r2, #1
        str     r2, [fp, #-2]
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-3]
        str     r2, [fp, #-4]
        mov     r3, #4
        ldr     r2, [fp, #-2]
        str     r3, [fp, #-5]
        .....

What I want to do is to merge neighboring stores and loads. For example
        mov     r3, #2
        mov     r2, #3
        str     r3, [fp, #-5]
        str     r2, [fp, #-4]
Can be converted to
        mov     r3, #2
        mov     r2, #3
        strd    r2, [fp, #-4]
But the main problem is that the offset for r3 in the snippet above was -3, not -5.

Currently, i'm doing the following. During the pre-RA i'm creating a REG_SEQUENCE with the target class, assigning vregs in question as its subregs, and create a load/store inst for the sequence with mem references merged.
It solves the register constraint problem, but the frame allocation problem still exists. Probably I'll need to use fixed stack objects and manually pre-allocate the frame, which i really don't want to do as it can break some other passes.

Petr


On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <[hidden email]> wrote:
That question makes no sense.
- Every virtual register has a register class assigned.
- You can construct special register classes that represent register tuples so that when the allocator chooses an entry from that register class it really has choosen a tuple of machine registers (even though it looks like a single register with funny aliasing as far as llvm codegen is concerned).

​And we still have to lower load i64 to load v2i32, right?​

--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev