[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev

Hi,

 

I have some questions about the implementation of Software pipeline in MachinePipeliner.cpp.

 

First, in hexagon backend, between MachinePipeliner and regalloc pass, there're some other passes like phi eliminate, two-address, register coalescing, which may change or insert intructions like 'copy' in MBB, and swp kernel loop may be destroyed by these passes.

Why not put MachinePipeliner just before reg alloc pass like gcc’s modulo scheduler does? In order to keep SSA pattern?

I found many codes to process PHI nodes in MachinePipeliner.cpp. So I think if we move MachinePipeliner just before regalloc, it will simplify the data/resource dependency graph for SMS.

 

Another question, in gcc, there's a flag BB_DISABLE_SCHEDULE in Basic block, which is used by SMS to prevent other schedulers from messing with the loop schedule. So, in llvm , where can I find the similar flag to prevent the machine scheduler touch the kernel loop?

I have debug some swp cases(hexagon), and find machine scheduler will re-schedule the SMS kernel loop. Why not add such a flag?

 

 

Best Regards,

 

Thanks


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev
Hi Brendon,

  Maybe you can explain the reason behind current SMS implementation somehow? :-)

Regards,
chenwj



2017-05-25 16:33 GMT+08:00 zhangqiang (CO) via llvm-dev <[hidden email]>:

Hi,

 

I have some questions about the implementation of Software pipeline in MachinePipeliner.cpp.

 

First, in hexagon backend, between MachinePipeliner and regalloc pass, there're some other passes like phi eliminate, two-address, register coalescing, which may change or insert intructions like 'copy' in MBB, and swp kernel loop may be destroyed by these passes.

Why not put MachinePipeliner just before reg alloc pass like gcc’s modulo scheduler does? In order to keep SSA pattern?

I found many codes to process PHI nodes in MachinePipeliner.cpp. So I think if we move MachinePipeliner just before regalloc, it will simplify the data/resource dependency graph for SMS.

 

Another question, in gcc, there's a flag BB_DISABLE_SCHEDULE in Basic block, which is used by SMS to prevent other schedulers from messing with the loop schedule. So, in llvm , where can I find the similar flag to prevent the machine scheduler touch the kernel loop?

I have debug some swp cases(hexagon), and find machine scheduler will re-schedule the SMS kernel loop. Why not add such a flag?

 

 

Best Regards,

 

Thanks


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




--
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev
In reply to this post by ORiordan, Martin via llvm-dev

Hi - I replied to the original sender only by mistake. Sorry about that.

 

When we started working on the pipeliner, and added it before the scheduler, we also were concerned that the scheduler or other passes would undo the work of the pipeliner. The initial thought was that we would add information (using metadata or some other way like you’ve suggested) to the basic block to tell the scheduler not to schedule the block.  It turns out, that for us, we never needed to do so.  It was pretty rare that the scheduler would “undo” the work of the pipeliner. Actually, in the cases that it did, it turned out to be a problem with the scheduler since it wasn’t making good decisions.

 

In general, most of the extra copies that are added by prior to the register allocator are eliminated.  Certainly, there are some real copies that end up being generated, but I think it’s better to exclude the copies from the schedule since most will be eliminated.  Otherwise, including the copies in the schedule will require resources that may never be used, which is worse in my opinion.

 

We decided to run the pipeliner on SSA form since the presence of the Phis helps identify recurrences and other dependences.  Without the Phis, we need another way to identify recurrences.  Also, if it’s done just prior to register allocation we need to re-generate the liveness information for all the new virtual registers and CFG. Unfortunately, you’re correct – there is a lot of code that deals with Phis. The code that generates the Phis in the swp kernel and epilogs is a mess and very complicated.  This portion of the pipeliner really needs some attention to reduce the complexity and improve readability.  This has been on my list for quite a while.

 

While I think we could move the location of the pipeliner, I don’t think the extra work to do so would provide much benefit. In general, we’ve been able to work around the cases when extra copies or instructions are added, or when the scheduler messes up the kernel.  Also, for Hexagon, there are many passes that run after the register that deal with scheduling. If you have specific cases where you’re seeing a problem, it would be interesting to take a look at them.

 

Thanks,

Brendon

 

From: zhangqiang (CO) [mailto:[hidden email]]
Sent: Thursday, May 25, 2017 3:33 AM
To: [hidden email]
Cc: [hidden email]
Subject: Some questions about software pipeline in LLVM 4.0.0

 

Hi,

 

I have some questions about the implementation of Software pipeline in MachinePipeliner.cpp.

 

First, in hexagon backend, between MachinePipeliner and regalloc pass, there're some other passes like phi eliminate, two-address, register coalescing, which may change or insert intructions like 'copy' in MBB, and swp kernel loop may be destroyed by these passes.

Why not put MachinePipeliner just before reg alloc pass like gcc’s modulo scheduler does? In order to keep SSA pattern?

I found many codes to process PHI nodes in MachinePipeliner.cpp. So I think if we move MachinePipeliner just before regalloc, it will simplify the data/resource dependency graph for SMS.

 

Another question, in gcc, there's a flag BB_DISABLE_SCHEDULE in Basic block, which is used by SMS to prevent other schedulers from messing with the loop schedule. So, in llvm , where can I find the similar flag to prevent the machine scheduler touch the kernel loop?

I have debug some swp cases(hexagon), and find machine scheduler will re-schedule the SMS kernel loop. Why not add such a flag?

 

 

Best Regards,

 

Thanks


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev

Hi Brendon

 

Certainly, there are some real copies that end up being generated, but I think it’s better to exclude the copies from the schedule since most will be eliminated. 

 

I was wondering what was the cause of the real copies that was being generated in your experience? Something that I noticed when experimenting with LLVM on our out-of-tree backend, was that there are copy instructions generated **because of** modulo scheduling.

 

For example before modulo scheduling I have

 

%vreg6<def> = PHI %vreg23, <BB#1>, %vreg17

%vreg25<def> = INSN1 %vreg1, %vreg6;

% vreg26<def> = INSN1 %vreg2, %vreg6     ß same opcode as previous insn

% vreg17<def> = INSN2 %vreg6, %vreg5;

 

So for the phi node here, if we do phi elimination and register coalescing, we won’t have any copy insn left. But after modulo scheduling the instructions above, now appear like this:

 

%vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

%vreg61<def> = INSN1 %vreg1, %vreg73;

%vreg62<def> = INSN2 %vreg73, %vreg5;

%vreg64<def> = INSN1 %vreg2, %vreg73;

 

Now if you look right after the third insn after modulo scheduling, both vreg73 and vreg62 are live here. So when we remove the corresponding phi instruction, we end up with a copy instruction that cannot be removed by register coalescing.

 

IIUC, this is a byproduct of modulo scheduling. I have not really started tuning modulo scheduling for our target, so I don’t know if this is a result of modulo scheduling not being tuned or not? Have you seen this type of Copy? Any insights are greatly appreciated.

 

Thanks

Ehsan

 

 

From: llvm-dev [mailto:[hidden email]] On Behalf Of Brendon Cahoon via llvm-dev
Sent: Wednesday, May 31, 2017 8:00 PM
To: zhangqiang (CO); [hidden email]
Subject: Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi - I replied to the original sender only by mistake. Sorry about that.

 

When we started working on the pipeliner, and added it before the scheduler, we also were concerned that the scheduler or other passes would undo the work of the pipeliner. The initial thought was that we would add information (using metadata or some other way like you’ve suggested) to the basic block to tell the scheduler not to schedule the block.  It turns out, that for us, we never needed to do so.  It was pretty rare that the scheduler would “undo” the work of the pipeliner. Actually, in the cases that it did, it turned out to be a problem with the scheduler since it wasn’t making good decisions.

 

In general, most of the extra copies that are added by prior to the register allocator are eliminated.  Certainly, there are some real copies that end up being generated, but I think it’s better to exclude the copies from the schedule since most will be eliminated.  Otherwise, including the copies in the schedule will require resources that may never be used, which is worse in my opinion.

 

We decided to run the pipeliner on SSA form since the presence of the Phis helps identify recurrences and other dependences.  Without the Phis, we need another way to identify recurrences.  Also, if it’s done just prior to register allocation we need to re-generate the liveness information for all the new virtual registers and CFG. Unfortunately, you’re correct – there is a lot of code that deals with Phis. The code that generates the Phis in the swp kernel and epilogs is a mess and very complicated.  This portion of the pipeliner really needs some attention to reduce the complexity and improve readability.  This has been on my list for quite a while.

 

While I think we could move the location of the pipeliner, I don’t think the extra work to do so would provide much benefit. In general, we’ve been able to work around the cases when extra copies or instructions are added, or when the scheduler messes up the kernel.  Also, for Hexagon, there are many passes that run after the register that deal with scheduling. If you have specific cases where you’re seeing a problem, it would be interesting to take a look at them.

 

Thanks,

Brendon

 

From: zhangqiang (CO) [[hidden email]]
Sent: Thursday, May 25, 2017 3:33 AM
To: [hidden email]
Cc: [hidden email]
Subject: Some questions about software pipeline in LLVM 4.0.0

 

Hi,

 

I have some questions about the implementation of Software pipeline in MachinePipeliner.cpp.

 

First, in hexagon backend, between MachinePipeliner and regalloc pass, there're some other passes like phi eliminate, two-address, register coalescing, which may change or insert intructions like 'copy' in MBB, and swp kernel loop may be destroyed by these passes.

Why not put MachinePipeliner just before reg alloc pass like gcc’s modulo scheduler does? In order to keep SSA pattern?

I found many codes to process PHI nodes in MachinePipeliner.cpp. So I think if we move MachinePipeliner just before regalloc, it will simplify the data/resource dependency graph for SMS.

 

Another question, in gcc, there's a flag BB_DISABLE_SCHEDULE in Basic block, which is used by SMS to prevent other schedulers from messing with the loop schedule. So, in llvm , where can I find the similar flag to prevent the machine scheduler touch the kernel loop?

I have debug some swp cases(hexagon), and find machine scheduler will re-schedule the SMS kernel loop. Why not add such a flag?

 

 

Best Regards,

 

Thanks


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev

Hi Ehsan,

 

In some cases modulo scheduling will insert copy instruction that end up as real copies in the final code.  It unavoidable in some cases.  For example, let’s say a instruction defining a value is scheduled in the first iteration, but one of its uses is scheduled two iterations later. In this case, the kernel needs to create a copy because there will be two values live in the kernel, from two other iterations.

 

        = R1     // use of the value from iteration n-2

        = R0     // use of the value from iteration n-1

  R0 = insn  // def at iteration n

  R1 = R0 

 

If all the uses of an instruction occur at most one iteration away, and the uses appear before the definition, then the copies should be coalesced away.

 

In the examples that you show below, it all depends in which iteration each instruction is scheduled and/or the order in which the instructions are scheduled.

  %vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

  %vreg61<def> = INSN1 %vreg1, %vreg73;

  %vreg62<def> = INSN2 %vreg73, %vreg5;

  %vreg64<def> = INSN1 %vreg2, %vreg73;

 

For some reason, the instruction defining vreg64 was scheduled after the instruction defining vreg62, which causes the copy to be generated.  Then, the question is why did that happen? That can be hard to answer without seeing the debug output from the pipeliner. In what order were the instructions scheduled? I would assume that its either vreg73, vreg61, vreg64, and then vreg62 , or it’s the opposite order.  If that’s the case, then there was a cycle gap in between the scheduling of vreg61 and vreg64, so vreg62 was inserted in between them. Perhaps, there are multi-cycle latencies that left the hole?  Also, can multiple instructions be executed in parallel in the same cycle?

 

Let me know if any of that isn’t clear.  I apologize for the delay in replying to your original email.

 

Thanks,

Brendon

 

From: Ehsan Amiri [mailto:[hidden email]]
Sent: Monday, June 19, 2017 1:55 AM
To: Brendon Cahoon <[hidden email]>
Cc: [hidden email]
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi Brendon

 

Certainly, there are some real copies that end up being generated, but I think it’s better to exclude the copies from the schedule since most will be eliminated. 

 

I was wondering what was the cause of the real copies that was being generated in your experience? Something that I noticed when experimenting with LLVM on our out-of-tree backend, was that there are copy instructions generated **because of** modulo scheduling.

 

For example before modulo scheduling I have

 

%vreg6<def> = PHI %vreg23, <BB#1>, %vreg17

%vreg25<def> = INSN1 %vreg1, %vreg6;

% vreg26<def> = INSN1 %vreg2, %vreg6     ß same opcode as previous insn

% vreg17<def> = INSN2 %vreg6, %vreg5;

 

So for the phi node here, if we do phi elimination and register coalescing, we won’t have any copy insn left. But after modulo scheduling the instructions above, now appear like this:

 

%vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

%vreg61<def> = INSN1 %vreg1, %vreg73;

%vreg62<def> = INSN2 %vreg73, %vreg5;

%vreg64<def> = INSN1 %vreg2, %vreg73;

 

Now if you look right after the third insn after modulo scheduling, both vreg73 and vreg62 are live here. So when we remove the corresponding phi instruction, we end up with a copy instruction that cannot be removed by register coalescing.

 

IIUC, this is a byproduct of modulo scheduling. I have not really started tuning modulo scheduling for our target, so I don’t know if this is a result of modulo scheduling not being tuned or not? Have you seen this type of Copy? Any insights are greatly appreciated.

 

Thanks

Ehsan

 

 


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev

Hi Brendon

 

Thanks for the answer. I completely agree with your comments. The main reason that I brought up this issue is the following: Inserting a COPY instruction that cannot be eliminated, means that the loop has an instruction that was not taken into account during modulo scheduling analysis. If we see these kind of copies frequently enough, do you think it is worthwhile to work on the algorithm, so these instructions are predicted and taken into account during the scheduling? Or maybe we already do this and I am not aware of it?

 

Some other remarks/questions:

 

IIUC, these kind of copies will be generated even if we implement SMS after register coalescing. Is this correct?

 

For us, so far we have enabled machine pipeliner for our backend and we see these kind of copy generated frequently for our workloads. Some times multiple copies inserted in a relatively small loop. IIUC, you don’t see it frequently though. Is that correct?

 

Thanks

Ehsan

 

 

 


From: Brendon Cahoon [[hidden email]]
Sent: Monday, June 26, 2017 6:22 PM
To: Ehsan Amiri
Cc: [hidden email]
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Hi Ehsan,

 

In some cases modulo scheduling will insert copy instruction that end up as real copies in the final code.  It unavoidable in some cases.  For example, let’s say a instruction defining a value is scheduled in the first iteration, but one of its uses is scheduled two iterations later. In this case, the kernel needs to create a copy because there will be two values live in the kernel, from two other iterations.

 

        = R1     // use of the value from iteration n-2

        = R0     // use of the value from iteration n-1

  R0 = insn  // def at iteration n

  R1 = R0 

 

If all the uses of an instruction occur at most one iteration away, and the uses appear before the definition, then the copies should be coalesced away.

 

In the examples that you show below, it all depends in which iteration each instruction is scheduled and/or the order in which the instructions are scheduled.

  %vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

  %vreg61<def> = INSN1 %vreg1, %vreg73;

  %vreg62<def> = INSN2 %vreg73, %vreg5;

  %vreg64<def> = INSN1 %vreg2, %vreg73;

 

For some reason, the instruction defining vreg64 was scheduled after the instruction defining vreg62, which causes the copy to be generated.  Then, the question is why did that happen? That can be hard to answer without seeing the debug output from the pipeliner. In what order were the instructions scheduled? I would assume that its either vreg73, vreg61, vreg64, and then vreg62 , or it’s the opposite order.  If that’s the case, then there was a cycle gap in between the scheduling of vreg61 and vreg64, so vreg62 was inserted in between them. Perhaps, there are multi-cycle latencies that left the hole?  Also, can multiple instructions be executed in parallel in the same cycle?

 

Let me know if any of that isn’t clear.  I apologize for the delay in replying to your original email.

 

Thanks,

Brendon

 

From: Ehsan Amiri [[hidden email]]
Sent: Monday, June 19, 2017 1:55 AM
To: Brendon Cahoon <[hidden email]>
Cc: [hidden email]
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi Brendon

 

Certainly, there are some real copies that end up being generated, but I think it’s better to exclude the copies from the schedule since most will be eliminated. 

 

I was wondering what was the cause of the real copies that was being generated in your experience? Something that I noticed when experimenting with LLVM on our out-of-tree backend, was that there are copy instructions generated **because of** modulo scheduling.

 

For example before modulo scheduling I have

 

%vreg6<def> = PHI %vreg23, <BB#1>, %vreg17

%vreg25<def> = INSN1 %vreg1, %vreg6;

% vreg26<def> = INSN1 %vreg2, %vreg6     ß same opcode as previous insn

% vreg17<def> = INSN2 %vreg6, %vreg5;

 

So for the phi node here, if we do phi elimination and register coalescing, we won’t have any copy insn left. But after modulo scheduling the instructions above, now appear like this:

 

%vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

%vreg61<def> = INSN1 %vreg1, %vreg73;

%vreg62<def> = INSN2 %vreg73, %vreg5;

%vreg64<def> = INSN1 %vreg2, %vreg73;

 

Now if you look right after the third insn after modulo scheduling, both vreg73 and vreg62 are live here. So when we remove the corresponding phi instruction, we end up with a copy instruction that cannot be removed by register coalescing.

 

IIUC, this is a byproduct of modulo scheduling. I have not really started tuning modulo scheduling for our target, so I don’t know if this is a result of modulo scheduling not being tuned or not? Have you seen this type of Copy? Any insights are greatly appreciated.

 

Thanks

Ehsan

 

 


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

ORiordan, Martin via llvm-dev

Hi Ehsan,

 

> If we see these kind of copies frequently enough, do you think it is worthwhile to work on the algorithm, so these instructions are predicted and taken into account during the scheduling?

 

if you do see them frequently, then it would be worthwhile to find a way to eliminate them. I’m not sure how easy it would be to predict them though when the MII is computed.  My guess is that you want some solution where the generated schedule doesn’t require the extra copies. Without some more information, it’s hard to see why the algorithm is generating a schedule that requires extra copies so frequently.

 

> Or maybe we already do this and I am not aware of it?

 

In some cases, the pipeliner does attempt to eliminate unnecessary copies.  For Hexagon, or any machine that allows multiple instructions per cycle, the pipeliner attempts to order the generated instructions to minimize copies. For example, Hexagon allows up to 4 instructions to execute “in parallel”.  So, in the same cycle, there can be a use of a register and a definition of that same register (the uses are read first, then the definition occurs). In effect, the use is for the value generated in the previous iteration. Since the pipeliner generates a linear list of instructions (i.e., serial semantics), the final order needs to make sure that the use is generated before the definition. Otherwise, an extra copy is generated.

 

> IIUC, these kind of copies will be generated even if we implement SMS after register coalescing. Is this correct?

 

That is correct.

 

> you don’t see it frequently though. Is that correct?

 

Correct. For Hexagon, we don’t see the extra copies very frequently.  In your earlier example though, it would be interesting to see why the algorithm puts the definition prior to the last use.  If possible, its better to schedule the definition after the last use. Of course, in some cases, that may/may not generate an efficient schedule.

 

Thanks,

Brendon

 

From: Ehsan Amiri [mailto:[hidden email]]
Sent: Tuesday, June 27, 2017 7:57 AM
To: Brendon Cahoon <[hidden email]>
Cc: [hidden email]
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi Brendon

 

Thanks for the answer. I completely agree with your comments. The main reason that I brought up this issue is the following: Inserting a COPY instruction that cannot be eliminated, means that the loop has an instruction that was not taken into account during modulo scheduling analysis. If we see these kind of copies frequently enough, do you think it is worthwhile to work on the algorithm, so these instructions are predicted and taken into account during the scheduling? Or maybe we already do this and I am not aware of it?

 

Some other remarks/questions:

 

IIUC, these kind of copies will be generated even if we implement SMS after register coalescing. Is this correct?

 

For us, so far we have enabled machine pipeliner for our backend and we see these kind of copy generated frequently for our workloads. Some times multiple copies inserted in a relatively small loop. IIUC, you don’t see it frequently though. Is that correct?

 

Thanks

Ehsan

 

 

 


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev