Separating loop nests based on profile information?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Daniel Berlin


On Mon, Jan 12, 2015 at 1:08 PM, Philip Reames <[hidden email]> wrote:

On 01/11/2015 10:35 PM, Daniel Berlin wrote:

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.

Scalar yes, load PRE, no.

What load PRE does is not the same as availability.
It finds the nearest dependencies (be they load/load, store/load, or clobber) for a load walking backwards.
That is not availability.

Availability is a forward problem, not a backwards one :)

It tries to transform the data it gets from memdep into something like availability info, but you *will not get the same results* in some cases.

  It's more or less directly organized around the classic "available" and "anticipated" concepts from the literature.  The *implementation* has some limitations, but the conceptual framework is there*. 

I agree the anticipation calculation is somewhat there, but even that is not a standard PRE one.

 
One of the key invariants in classic PRE is that you do not change program behaviour on any path. 
Well, it's more "do not increase the number of computations on any path", but even that gets violated in most standard PRE's because they do LICM out of loops.
 
As a result, moving (or at most duplicating) loads is pretty much all that happens.

Sure.
 
 

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is.  The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

memorydependencyanalysis is a backwards walker, it does not compute availability of a load.
It computes the nearest dependencies.  We then try to take this and turn it into the result of a forward availability problem.
 

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

This is definitely a standard SSA based PRE algorithm. Every single one i am familiar with will all move code out of loops like this, be it scalar or loads, regardless of the fact that it does in fact, increase the number of possible computations by one if the loop does not execute.  

(If I'm being overly pedantic here, or just flat out wrong, please say so.  I'm still exploring related material and am trying to categorize things in my own head.)


I do not believe you will find an SSA based  PRE that will not do the above optimization by default.


I think we actually in reasonable agreement that we do need to move beyond 'classic PRE' as I called it above.  If you have specific suggestions, positive or negative examples, or other thoughts to share, I'd very much value the input.  


If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 

So, LLVM has been through three PRE implementations.
It had an implementation of e-path PRE, an implementation of GVNPRE (that only worked on scalar code i believe) a long long time ago, and then the current one.

The current one is used because it's fast and catches a lot of cases.
The others were eliminated because they were essentially experimental research-quality implementations, and were slow :)

Because the current PRE is not value based, and it really has no idea about memory, you will run into limitations on loads pretty quickly (particularly loads indexed by things like induction variables).
It already tries to work around some of them by doing some actual value analysis of the load (and doing forwarding), but this only works on basic cases.

GVNPRE can be implemented and made a lot faster than it was (In GCC, it's actually one of the fastest high level optimization passes) with very careful engineering, and would take care of all of these cases.

I'm also fairly positive you could take any good anticipation based PRE algorithm you can find, and i could make it value based and still fast.

But in the end, pushing the current one seems like a bad idea to me because it wasn't designed for what you are trying to do - get good PRE results, and do good load PRE, and we know of better ways to do that.

All pushing it will do is push it out of the sweet spot it was designed for.

If you want to get into profille-guided placement or speculative PRE, it involves solving min-cut problems on very large graphs in most cases.

The only implementation i'm aware of that is relatively sane is http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-horspool.pdf
and

(This does not require solving min-cut problems)



In the end though, you are doing the work, so you get to decide what you want to do :)


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Krzysztof Parzyszek
After re-reading the linked paper, I have to agree with Krzysztof. ISS
involves splitting range conditions on the induction variable. The
transform that I started this thread with works with any loop variant
conditional, not simply those based on the induction variable.  My
initial example was intended to convey this, but reading over it again,
I see why it might not have.

Its worth stating that index set splitting is a much better approach for
checks which *are* off the induction variable.  In fact, Sanjoy is
currently working on something similar for range check elimination.  
http://reviews.llvm.org/D6693

Philip

On 01/12/2015 10:55 AM, Krzysztof Parzyszek wrote:

> On 1/12/2015 12:04 AM, Daniel Berlin wrote:
>>
>> This is essentially index set splitting with the goal of it being an
>> enabling optimization for LICM.
>
> This is very different from index set splitting.  In ISS you know
> ahead of time into what ranges you can divide the values of the
> induction variable.  Here you don't.
>
> -Krzysztof
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

You can see my stalking horse implementation here:
http://reviews.llvm.org/D6872

This is not an actual patch per se; it's just intended to make the discussion more concrete and thus hopefully easier to follow. 

The basic idea of this patch is to use profiling information about the frequency of a backedges to separate a loop with multiple latches into a loop nest with the rare backedge being the outer loop. We already use a set of static heuristics based on non-varying arguments to PHI nodes to do the same type of thing.

The motivation is that doing this pulls rarely executed code out of the innermost loop and tends to greatly simplify analysis and optimization of that loop.


This is good thing to do from the code/block placement point of view -- to improve icache utilization. Code layout is one of the most important passes that can benefit from profile data. 

 

In particular, it can enable substantial LICM gains when there are clobbering calls down rare paths.


How can this enable more LICM? 
 

The original motivating case for this was the slow path of a safepoint poll, but I believe the possible benefit is more general as well.

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?

yes.
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?

IMO no. Remember your pass should also work with real profile data (instrumentation or sample based) -- you should  rely on existing profile infrastructure to provide what you need. (Ideally you should treat it as something that is always available for query).
 
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?

Please do not introduce new notions for profile information -- there are already many:)  You have block 'frequency' to represent intra-function relative hotness. Currently missing, but if we need to represent inter-procedural global hotness, something like bb execution 'count' may be needed. 

thanks,

David
 
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.






_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Chandler Carruth-2


On Wed, Jan 7, 2015 at 5:33 PM, Chandler Carruth <[hidden email]> wrote:

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

Yep.

I've not thought about this a lot, but I have two initial questions that maybe you can answer:

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)


It is not any of those loop transformations. It does not even change the control flow. It is more a code layout optimization -- move the cold trace in the loop out of the body.
 
 

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).

yes.
 
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 

Static prediction should handle it  --  call heuristics or some combination with other heuristics (error handler recognition etc).


David
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <[hidden email]> wrote:

On 01/07/2015 05:33 PM, Chandler Carruth wrote:
How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it?
I'm still developing a good sense for this, but let me lay out some observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


With profile info, speculative PRE can be performed for memory access that are not downsafe:

temp_c = c;
if (temp_c)
{
    x = this->x;
    y = this->y;
}
while (temp_c) {
   if (x == y)
     {
          rare();
          x = this->x;
          y = this->y;
      }
   temp_c = c;
}

If you can prove this->x etc are safe control speculative (e.g, seen a dominating memory access with the same base and offset), it can be 

x = this->x;
y = this->y;
while (c) {
   if (x == y) {
      rare();
      x = this->x;
      y = this->y;
    }
 }


 
If we could tell x and y were loop invariant, we could unswitch this loop.  However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails.

We'd like to apply PRE here and push the reload into the loop preheader and the rare case.  This currently fails for two reasons: 1) We don't allow code duplication during PRE, and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block.

I think both of the previous are fixable:
1) This calls for profile guided duplication.  If we can trade one hot load for two cold ones, that's great.  Also, there might be static heuristics we could construct.  (e.g. if it's not available in the preheader and one of many latches...)
2) We need PRE to look back through merge blocks.  This should possibly even live in MemoryDependenceAnalysis.  Alternatively, we could create a "deconstruct loop simplify form" pass to run right before GVN to side step this issue.

Neither is entirely simple to implement, but both are doable. Probably.

In theory, we've now solved the same problem my loop nest transform does.  There's two snags: one minor(ish), one major.

The minor one is that many of my rare conditions involve volatile loads.  MemDepAnalysis currently completely gives up when faced with volatile loads.  This is mostly fixable, but given the structure of the code may be fairly invasive.  Oddly, LICM actually performs better in this case.

For the major one, let's tweak our example slightly:
while(c) {
  x = this->x;
  if (x == null) return;
  y = this->y;
  if (y == null) return;
  if (x == y) {
    rare();
  }
}

Let's assume that PRE worked as designed for 'x'.  We now need to perform a transform analogous to loop unswitch, but a bit different.  We need to move the condition loop exits into each of the predecessors where the answers not available.  I don't believe we have anything like this today, and it involves a non-trivial bit of code.  Even if we get that done, we're left with a pass ordering problem where we need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..  (In practice, these chains can be quite long due to null checks and range checks.)

If we'd split this into a loop nest structure, we'd still have the chain, but a) that's easier to control with a custom pass order since LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of trivial loop unswitch for the terminator of the header appears quite tractable.


if (c)
{
   x = this->x;
   if (!x) return;
   y = this->y;
   if (!y) return;
}
while (c) {
  if (x == y) {
      rare();
      if (!x) return;
      if (!y) return;
   }
}

The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination) part is a little tricky to do though.

David

 

One other point worth mentioning is that LLVM does not currently have a loop peeling pass.  LoopRotate seems to get us most of the way there in common cases, but not always.  If we did peel the above loop once, we'd have the same merge block problem I mentioned previously (this time in the preheader *and* the latch), but the profitability becomes slightly more clear without the need for profile information or loop specific heuristics.

To be clear, I'm not opposed to fixing some of the issues above.  I think that's a great idea, I'm just trying to find my way through to practical performance improvements with minimal invested effort.  I probably will in fact fix some of these since they show up in a number of other cases as well.


Ok, so that's that.  Let's move on to why my proposed loop nest separation change is a bit scary.

While the loop nest does allow things to be lifted from the inner loop to the outer loop - without any of the complex PRE or PGO reasoning required above - it suffers from all of the limitations of the existing loop optimization passes.  In particular, LICM may not actually be able to do much if the inner loop is still complicated enough to defeat either it's fairly simple alias analysis or fault analysis.  GVN/PRE is strictly more powerful - in theory if not always practice - here than what we have in LICM.

We've also potentially created a loop nest which is harder to analyse than the original loop with two latches.  Consider this example:
for(int i = 0; i < 20000) {
   i++;
   if(i % 20 != 0) continue;
   a[i] = 0;
}
(This is not a real example of anything useful.  The important point is the early return is the common case.)

This loop clearly runs exactly 20000 times.  ScalarEvolution almost certainly gets this right. (I haven't checked.)

After separation, we'd have this:

int i = 0;
while(true) {
  while(i < 20000) {
     i++;
     if(i % 20 == 0) { // now a loop exit
       a[i] = 0;
       goto next:
     }
}
break:
  next:
}

It's not clear to me that SE will always be able to tell that both the inner and outer loop runs a maximum of 20000 times.  (In this particular case it might actually optimize the inner loop substantially and thus get a more precise answer, but let's ignore that.)  We could potentially loose information about the loop by doing this transformation.  Of course, this concern applies to the *existing* loop nest separation heuristics too.  I don't have any particular reason to believe one heuristic is worse than the other, but I also have no reason not to believe that either.

As a final point, the inner loop now has two exiting blocks not one.  I know at least some of our loop transforms give up on anything with more than one loop exit.  I'm less concerned by that since a) we should fix them, and b) the cases I've looked at don't look too hard.


And that's everything that occurs to me at the moment.  I'll probably have to append in further responses as I remember more details, I know I'm forgetting some pieces.

Philip




_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Mon, Jan 12, 2015 at 10:08 AM, Philip Reames <[hidden email]> wrote:

On 01/11/2015 10:35 PM, Daniel Berlin wrote:

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.  It's more or less directly organized around the classic "available" and "anticipated" concepts from the literature.  The *implementation* has some limitations, but the conceptual framework is there*.  One of the key invariants in classic PRE is that you do not change program behaviour on any path.  As a result, moving (or at most duplicating) loads is pretty much all that happens.

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is.  The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

(If I'm being overly pedantic here, or just flat out wrong, please say so.  I'm still exploring related material and am trying to categorize things in my own head.)

I think we actually in reasonable agreement that we do need to move beyond 'classic PRE' as I called it above.  If you have specific suggestions, positive or negative examples, or other thoughts to share, I'd very much value the input. 

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 


I agree with Danny.  Profile data is useful  for code placement, and profitability computation in speculative PRE. I don't see how your proposed code movement can be an enabler for more PRE.

David
  
It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Adam Nemet
In reply to this post by Philip Reames-4

On Jan 8, 2015, at 12:37 PM, Philip Reames <[hidden email]> wrote:


On 01/07/2015 06:42 PM, Adam Nemet wrote:


How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here…)

I think it’s pretty different from loop distribution/fission.  In loop distribution, in order to split the loop into multiple consecutive loops, you need to reason about the memory references so having a function call makes that difficult/impossible.  Phillip’s idea works around this exact issue.

I explored this space quite a bit recently because we’re working on a proposal to add loop distribution to LLVM.  I hope to send out the proposal this week.
Just to note, I'm going to be very interested in seeing this.  I have concerns about profitability, but having a mechanism is clearly a good thing.  :)
Are you also looking at loop fusion?  I've seen that come up in practice for a few cases where it would *really* help.  Generally, filter/reduce patterns implemented by hand.  Nowhere near the top of my priority list, but if you happened to be working on it, I'm happy to help review and brainstorm. 

Yes, we’re planning to look at fusion next.  Thanks for all your feedback so far!

As to your example, do you mean chain of operations functional-programming style?  E.g. with OO syntax:

list.map(func1).map(func2).filter(func3).reduce(func4), where these could be squashed into one loop.

Thanks,
Adam

So I see no issue with trying to handle loops with low-probablity function calls with this and fully self-contained loops with classical loop distribution.

Thanks,
Adam

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev




_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4

On 01/15/2015 09:59 AM, Adam Nemet wrote:

On Jan 8, 2015, at 12:37 PM, Philip Reames <[hidden email]> wrote:


On 01/07/2015 06:42 PM, Adam Nemet wrote:


How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here…)

I think it’s pretty different from loop distribution/fission.  In loop distribution, in order to split the loop into multiple consecutive loops, you need to reason about the memory references so having a function call makes that difficult/impossible.  Phillip’s idea works around this exact issue.

I explored this space quite a bit recently because we’re working on a proposal to add loop distribution to LLVM.  I hope to send out the proposal this week.
Just to note, I'm going to be very interested in seeing this.  I have concerns about profitability, but having a mechanism is clearly a good thing.  :)
Are you also looking at loop fusion?  I've seen that come up in practice for a few cases where it would *really* help.  Generally, filter/reduce patterns implemented by hand.  Nowhere near the top of my priority list, but if you happened to be working on it, I'm happy to help review and brainstorm. 

Yes, we’re planning to look at fusion next.  Thanks for all your feedback so far!

As to your example, do you mean chain of operations functional-programming style?  E.g. with OO syntax:

list.map(func1).map(func2).filter(func3).reduce(func4), where these could be squashed into one loop.
That's exactly what I was thinking of. 

Thanks,
Adam

So I see no issue with trying to handle loops with low-probablity function calls with this and fully self-contained loops with classical loop distribution.

Thanks,
Adam

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev





_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Philip Reames-4
(Splitting this out from "Re: [LLVMdev] Separating loop nests based on
profile information?" since it's now a fairly different topic.)

On 01/09/2015 11:39 AM, Philip Reames wrote:

> On 01/08/2015 12:33 PM, Philip Reames wrote:
>> 2) We need PRE to look back through merge blocks.  This should
>> possibly even live in MemoryDependenceAnalysis.  Alternatively, we
>> could create a "deconstruct loop simplify form" pass to run right
>> before GVN to side step this issue.
> Point 2 here may be inaccurate.  I've definitely seen problems here in
> the past, but when I sat down to write some test cases, it looks like
> the simple merge block for latch (with the load being considered in
> the loop header) works just fine.  I suspect my original example is
> actually blocked by something else; I'm just not sure what that is yet.
I went back and looked.  One - of possibly several - cases that were
giving me trouble was when we had a loop with multiple latches before
loop-simplify, loop simplify creates a unified latch, and we had not
peeled the first iteration of the loop when we get to PRE. This creates
a tree of merges that PRE doesn't deal with.

I took a stab at addressing this with a PRE patch.  It can be found
here: http://reviews.llvm.org/D7061

As I should have expected, it turned out to be quite a bit more involved
that it first looked.  :)

Philip
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Load PRE in loops with 'mostly' invariant loads

Philip Reames-4
In reply to this post by Philip Reames-4
(Splitting this out from "Re: [LLVMdev] Separating loop nests based on
profile information?" since it's now a fairly different topic.)

On 01/09/2015 11:39 AM, Philip Reames wrote:

> On 01/08/2015 12:33 PM, Philip Reames wrote:
>> 2) We need PRE to look back through merge blocks.  This should
>> possibly even live in MemoryDependenceAnalysis.  Alternatively, we
>> could create a "deconstruct loop simplify form" pass to run right
>> before GVN to side step this issue.
> Point 2 here may be inaccurate.  I've definitely seen problems here in
> the past, but when I sat down to write some test cases, it looks like
> the simple merge block for latch (with the load being considered in
> the loop header) works just fine.  I suspect my original example is
> actually blocked by something else; I'm just not sure what that is yet.
I went back and looked.  One - of possibly several - cases that were
giving me trouble was when we had a loop with multiple latches before
loop-simplify, loop simplify creates a unified latch, and we had not
peeled the first iteration of the loop when we get to PRE. This creates
a tree of merges that PRE doesn't deal with.

I took a stab at addressing this with a PRE patch.  It can be found
here: http://reviews.llvm.org/D7061.  Comments and discussion welcome.

As I should have expected, it turned out to be quite a bit more involved
that it first looked.  :)

Philip
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Xinliang David Li-2

On 01/12/2015 07:27 PM, Xinliang David Li wrote:


On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
The basic idea of this patch is to use profiling information about the frequency of a backedges to separate a loop with multiple latches into a loop nest with the rare backedge being the outer loop. We already use a set of static heuristics based on non-varying arguments to PHI nodes to do the same type of thing.

The motivation is that doing this pulls rarely executed code out of the innermost loop and tends to greatly simplify analysis and optimization of that loop.


This is good thing to do from the code/block placement point of view -- to improve icache utilization. Code layout is one of the most important passes that can benefit from profile data.
We already try to do that using profile information at a later point.  I don't know how effective that is and how much room we might have to improve, but that really wasn't a motivation for this approach. 

 

In particular, it can enable substantial LICM gains when there are clobbering calls down rare paths.


How can this enable more LICM?
Given you now have a loop nest with a simpler inner loop, we can sometimes perform LICM on that inner loop.

Note that this is equivalent to running a smarter PRE on the original loop structure.
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?

IMO no. Remember your pass should also work with real profile data (instrumentation or sample based) -- you should  rely on existing profile infrastructure to provide what you need. (Ideally you should treat it as something that is always available for query).
There's a distinction here between 'source of information' and 'source of heuristics'.  By accessing the metadata directly, I'm still using whatever profile information the user provided. 

I'm currently exploring other approaches to this problem, but if I do end up exploring this further, I'll likely rewrite to use one the existing analyses. 

Philip

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Xinliang David Li-2

On 01/12/2015 07:39 PM, Xinliang David Li wrote:


On Wed, Jan 7, 2015 at 5:33 PM, Chandler Carruth <[hidden email]> wrote:


How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)


It is not any of those loop transformations. It does not even change the control flow. It is more a code layout optimization -- move the cold trace in the loop out of the body.
It's also a somewhat major change in the canonical form of the loop as seen by the optimizer.  That's probably the more important part in practice. 
 

  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 

Static prediction should handle it  --  call heuristics or some combination with other heuristics (error handler recognition etc).

Sorry, I'm really not sure what you're trying to say here.  Could you clarify/expand? 

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Xinliang David Li-2

On 01/12/2015 08:28 PM, Xinliang David Li wrote:


On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <[hidden email]> wrote:

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


With profile info, speculative PRE can be performed for memory access that are not downsafe:
Could you define the term "downsafe"?  I think I know what you mean, but just to be clear. 

temp_c = c;
if (temp_c)
{
    x = this->x;
    y = this->y;
}
while (temp_c) {
   if (x == y)
     {
          rare();
          x = this->x;
          y = this->y;
      }
   temp_c = c;
}

If you can prove this->x etc are safe control speculative (e.g, seen a dominating memory access with the same base and offset), it can be 

x = this->x;
y = this->y;
while (c) {
   if (x == y) {
      rare();
      x = this->x;
      y = this->y;
    }
 }
Yep.  In LLVM, this basically requires that this->FIELD be known deferenceable.  I filled one simple bug for this here: http://llvm.org/bugs/show_bug.cgi?id=22266

I've also looked a more general rewrite of our PRE algorithm when a pointer is known dereferenceable, but I haven't figured out to judge profitability accurately (yet).  The general approach was to just take the cut provided by MDA, apply "obvious" improvements (i.e. local merge points), the insert loads in all the unavailable blocks if it looked profitable. 

The alternate approach did have the appeal of being "easy" in cases where our current approach (work backwards from load) is "hard". 

One challenge is in making sure the two algorithms together generate a stable final form.  :)

That last part is where I stalled.  I'm trying to see what else I can do with simpler things before returning to that approach. 

Nick Lewicky also pointed out that PHITranslation is problematic in the current code.  Oddly, I've never seen that in practice.  We clearly have different workloads, but I haven't figured out which are the important different properties yet. 




 
If we'd split this into a loop nest structure, we'd still have the chain, but a) that's easier to control with a custom pass order since LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of trivial loop unswitch for the terminator of the header appears quite tractable.


if (c)
{
   x = this->x;
   if (!x) return;
   y = this->y;
   if (!y) return;
}
while (c) {
  if (x == y) {
      rare();
      if (!x) return;
      if (!y) return;
   }
}

The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination) part is a little tricky to do though.
I think "a little tricky" might be an understatement here. 

At least to start with, I'd probably try to handle simple cases.  Even just doing trivial loop unswitch in the loop with LoadPRE would unlock a lot.  (Right now, I'm just running LICM and Unswitch in a loop.  It's not that expensive, gets a lot of cases, but doesn't get everything.)

Philip

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Krzysztof Parzyszek
This is correct.  JumpThreading would pretty much undo this
transformation.  Oddly, SimplifyCFG does not, even though it probably
could (and possible should.)

Philip

On 01/12/2015 11:40 AM, Krzysztof Parzyszek wrote:

> Upon closer inspection, the loop will not become irreducible.  
> However, the result will be equivalent to reloading the hoisted values
> after calling "helper".  In fact, that's what may happen if some CFG
> optimization cleans this up.
>
> -Krzysztof
>
>
> On 1/12/2015 1:00 PM, Krzysztof Parzyszek wrote:
>> On 1/7/2015 7:19 PM, Philip Reames wrote:
>>>
>>> The approach I've been playing with would create this loop structure:
>>> goto inner
>>> while (true) {
>>>    outer:
>>>    helper();
>>>    inner:
>>>    while( some_condition )
>>>      //do lots of work here
>>>      if (!some_rare_condition) { <-- This is loop variant
>>>        goto outer;
>>>      }
>>>    }
>>> }
>>>
>>
>> Anything that changes loop structure into something like this, is a
>> fundamentally bad idea.  This would make the outer loop irreducible,
>> which should be avoided.
>>
>> A lot better approach would be to privatize the invariant objects, and
>> write them back to memory before/after the "helper" is invoked:
>>
>>      T1 = p->x;
>>      T2 = P->y;
>>      while( some_condition )
>>        //do lots of work here, use T1 and T2
>>        if (some_rare_condition) { <-- This is loop variant
>>          p->x = T1;
>>          p->y = T2;
>>          helper();
>>          T1 = p->x;
>>          T2 = p->y;
>>        }
>>      }
>>      p->x = T1;
>>      p->y = T2;
>>
>>
>> -Krzysztof
>>
>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Mon, Jan 19, 2015 at 5:41 PM, Philip Reames <[hidden email]> wrote:

On 01/12/2015 07:27 PM, Xinliang David Li wrote:


On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
The basic idea of this patch is to use profiling information about the frequency of a backedges to separate a loop with multiple latches into a loop nest with the rare backedge being the outer loop. We already use a set of static heuristics based on non-varying arguments to PHI nodes to do the same type of thing.

The motivation is that doing this pulls rarely executed code out of the innermost loop and tends to greatly simplify analysis and optimization of that loop.


This is good thing to do from the code/block placement point of view -- to improve icache utilization. Code layout is one of the most important passes that can benefit from profile data.
We already try to do that using profile information at a later point.  I don't know how effective that is and how much room we might have to improve, but that really wasn't a motivation for this approach. 

 

In particular, it can enable substantial LICM gains when there are clobbering calls down rare paths.


How can this enable more LICM?
Given you now have a loop nest with a simpler inner loop, we can sometimes perform LICM on that inner loop.

Note that this is equivalent to running a smarter PRE on the original loop structure.


yes -- it works around the weakness of PRE (of existing implementation).
  


 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?

IMO no. Remember your pass should also work with real profile data (instrumentation or sample based) -- you should  rely on existing profile infrastructure to provide what you need. (Ideally you should treat it as something that is always available for query).
There's a distinction here between 'source of information' and 'source of heuristics'.  By accessing the metadata directly, I'm still using whatever profile information the user provided. 


By accessing meta data directly, you basically bypass the static branch prediction and only make use of  real profile data or data from user annotation. 
 
I'm currently exploring other approaches to this problem, but if I do end up exploring this further, I'll likely rewrite to use one the existing analyses. 

David
 

Philip


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Mon, Jan 19, 2015 at 5:43 PM, Philip Reames <[hidden email]> wrote:

On 01/12/2015 07:39 PM, Xinliang David Li wrote:


On Wed, Jan 7, 2015 at 5:33 PM, Chandler Carruth <[hidden email]> wrote:


How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)


It is not any of those loop transformations. It does not even change the control flow. It is more a code layout optimization -- move the cold trace in the loop out of the body.
It's also a somewhat major change in the canonical form of the loop as seen by the optimizer.  That's probably the more important part in practice. 

right. 
 

  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 

Static prediction should handle it  --  call heuristics or some combination with other heuristics (error handler recognition etc).

Sorry, I'm really not sure what you're trying to say here.  Could you clarify/expand? 

It is more of a comment about identify rare blocks -- there are cases where static heuristics can produce branch prediction with high confidence. 

David


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Xinliang David Li-2
In reply to this post by Philip Reames-4


On Mon, Jan 19, 2015 at 5:55 PM, Philip Reames <[hidden email]> wrote:

On 01/12/2015 08:28 PM, Xinliang David Li wrote:


On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <[hidden email]> wrote:

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


With profile info, speculative PRE can be performed for memory access that are not downsafe:
Could you define the term "downsafe"?  I think I know what you mean, but just to be clear. 


temp_c = c;
if (temp_c)
{
    x = this->x;
    y = this->y;
}
while (temp_c) {
   if (x == y)
     {
          rare();
          x = this->x;
          y = this->y;
      }
   temp_c = c;
}

If you can prove this->x etc are safe control speculative (e.g, seen a dominating memory access with the same base and offset), it can be 

x = this->x;
y = this->y;
while (c) {
   if (x == y) {
      rare();
      x = this->x;
      y = this->y;
    }
 }
Yep.  In LLVM, this basically requires that this->FIELD be known deferenceable.  I filled one simple bug for this here: http://llvm.org/bugs/show_bug.cgi?id=22266

What you proposed in the bug is basically profitability heuristics for speculative PRE.
 


I've also looked a more general rewrite of our PRE algorithm when a pointer is known dereferenceable, but I haven't figured out to judge profitability accurately (yet).  The general approach was to just take the cut provided by MDA, apply "obvious" improvements (i.e. local merge points), the insert loads in all the unavailable blocks if it looked profitable. 

See also paper "Register promotion by sparse partial redundancy elimination of loads and stores"  for a simple way of evaluating speculation savings.
 

The alternate approach did have the appeal of being "easy" in cases where our current approach (work backwards from load) is "hard". 

One challenge is in making sure the two algorithms together generate a stable final form.  :)

That last part is where I stalled.  I'm trying to see what else I can do with simpler things before returning to that approach. 

Nick Lewicky also pointed out that PHITranslation is problematic in the current code.  Oddly, I've never seen that in practice.  We clearly have different workloads, but I haven't figured out which are the important different properties yet. 




 
If we'd split this into a loop nest structure, we'd still have the chain, but a) that's easier to control with a custom pass order since LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of trivial loop unswitch for the terminator of the header appears quite tractable.


if (c)
{
   x = this->x;
   if (!x) return;
   y = this->y;
   if (!y) return;
}
while (c) {
  if (x == y) {
      rare();
      if (!x) return;
      if (!y) return;
   }
}

The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination) part is a little tricky to do though.
I think "a little tricky" might be an understatement here. 

agreed.
 

At least to start with, I'd probably try to handle simple cases.  Even just doing trivial loop unswitch in the loop with LoadPRE would unlock a lot.  (Right now, I'm just running LICM and Unswitch in a loop.  It's not that expensive, gets a lot of cases, but doesn't get everything.)

Have fun with it :)

David 


Philip


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Daniel Berlin
On 01/12/2015 01:11 PM, Daniel Berlin wrote:

On Mon, Jan 12, 2015 at 1:08 PM, Philip Reames <[hidden email]> wrote:

On 01/11/2015 10:35 PM, Daniel Berlin wrote:
Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.

Scalar yes, load PRE, no.
Can you definite specifically what you mean by "load pre"?  To me, load pre is simply PRE on loads.  It sounds like you're saying something slightly different. 

What load PRE does is not the same as availability.
It finds the nearest dependencies (be they load/load, store/load, or clobber) for a load walking backwards.
That is not availability.

Availability is a forward problem, not a backwards one :)

It tries to transform the data it gets from memdep into something like availability info, but you *will not get the same results* in some cases.
I would argue that the current implementation is calculating availability.  It's simply doing in a conservative manner and thus gives up some quality in the results achieved.  The notion of availability (i.e. "there's a load available along all paths that reach this basic block") is the same.

Just to make sure we're on the same page, the reason why the backwards calculation is potentially less useful than the forward one is that we unconditionally stop when encountering a def.  There might have been an earlier def along that path and thus possible a cheaper location to put a load.  Is that the only case?  Or is there something else I haven't seen yet?

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is.  The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

memorydependencyanalysis is a backwards walker, it does not compute availability of a load.
It computes the nearest dependencies.  We then try to take this and turn it into the result of a forward availability problem.
(See above comment)
 

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

This is definitely a standard SSA based PRE algorithm. Every single one i am familiar with will all move code out of loops like this, be it scalar or loads, regardless of the fact that it does in fact, increase the number of possible computations by one if the loop does not execute. 
I'd written a response on the technicalities here, then realized it really didn't matter.  From a practical perspective, we definitely want our PRE implementation to pull loads out of loops when legal, even if that does introduce a load along a path it didn't exist previously.  (Note, there is that pesky legality issue.) 

I've got a patch up for discussion right now in exactly this vein: http://reviews.llvm.org/D7061

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 

So, LLVM has been through three PRE implementations.
It had an implementation of e-path PRE, an implementation of GVNPRE (that only worked on scalar code i believe) a long long time ago, and then the current one.

The current one is used because it's fast and catches a lot of cases.
The others were eliminated because they were essentially experimental research-quality implementations, and were slow :)
After looking more at both the current implementation and the literature, I've largely accepted we need a more general PRE algorithm.  Unfortunately, I really don't have the background to implement such a thing today.  I can gain it, but that's an investment of time I'm not sure I can justify right now.  I'm slowly ramping up, but it's taking a while. 

Because the current PRE is not value based, and it really has no idea about memory, you will run into limitations on loads pretty quickly (particularly loads indexed by things like induction variables).
It already tries to work around some of them by doing some actual value analysis of the load (and doing forwarding), but this only works on basic cases.
Ok, again we've run into a terminology problem.  I don't understand why you mean by "is not value based" and "has no idea about memory". 

Given that we're both in the Mountain View area, would you be interested in just meeting for lunch or a beer?  Trying to go back and forth in writing is going to be far slower than talking in person. 

GVNPRE can be implemented and made a lot faster than it was (In GCC, it's actually one of the fastest high level optimization passes) with very careful engineering, and would take care of all of these cases.

I'm also fairly positive you could take any good anticipation based PRE algorithm you can find, and i could make it value based and still fast.

But in the end, pushing the current one seems like a bad idea to me because it wasn't designed for what you are trying to do - get good PRE results, and do good load PRE, and we know of better ways to do that.

All pushing it will do is push it out of the sweet spot it was designed for.
(As I said above, I've largely convinced myself you're right.  I do want to extend the current algorithm in a few, hopefully narrow, ways, but in the mid to long run, we'll need something better.)

If you want to get into profille-guided placement or speculative PRE, it involves solving min-cut problems on very large graphs in most cases.
Yeah, I'd come to this conclusion myself. 

Now, purely for the sake of argument, what's wrong with that?  Given you can use max flow to solve min-cut in O(V*E^2), that doesn't seem too bad.  I know we generally try to avoid extra linear growth, but it seems like you could a) reduce the graph substantially without much work, and b) cap the region you look at.  

Here's one possible approach.  Don't take this too seriously, I'm just thinking through the implications.

Start by computing forward availability for the location in question.  (In particular, I'm assuming availability starts at the earliest point on a path a load would be legal, not the first place it occurs.  This isn't(?) quite the standard definition.)

Pose a weighted min-cut problem on this subset of the CFG identified as being fully available.  The source vertices are the nodes with no available predecessors, the sink vertices are the *uses* of the current loads, and the weights are profile weights + some factor to represent code size.

Before running mincut, summarize any loop in the original CFG which is fully available (and thus completely in the refined graph) with a single node.  We statically know that placing the load in the loops preheader is at least as good as any other placement in the loop. 

Before running mincut, reduce weights to the minimum of all weights in dominating and "obviously" post dominating nodes.  This is just to reduce the amount of work path finding might do in a subgraph.  It's not clear this would actually be worthwhile.  (If we had a post dom tree, we could remove the "obviously" part of that.)

To cap the region, we could do something like only include the subgraph contained by the parent of the inner most loop which contains the loads from the location we're interested in.  We'd then have to run this iteratively (which is potentially faster due to loop summarization above), but we should reach the same final result...

(The starting assumption in the above is that we already have both dom tree information and loop information.  We can exploit that in the max flow problem.)

There might also be room for a "cheap pre" and an "expensive pre" here.  Particularly if the "cheap-pre" got things out of loops and the "expensive-pre" exploited the loop summarization trick...  Or vice versa possibly, getting loads out of loops is where it's worth spending time, so maybe we do an "expensive pre" only on loops and the "cheap" version outside of inner loops? 

Again, this feels like a topic where a) I need to read a lot more papers, and b) it would be good to talk in person. 
I haven't gotten through this yet, but I'll add it to my reading list.
In the end though, you are doing the work, so you get to decide what you want to do :)
Yes, but I need to get it accepted by reviewers too.  :)

Philip


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
12