Separating loop nests based on profile information?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Separating loop nests based on profile information?

Philip Reames-4
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

You can see my stalking horse implementation here:
http://reviews.llvm.org/D6872

This is not an actual patch per se; it's just intended to make the discussion more concrete and thus hopefully easier to follow. 

The basic idea of this patch is to use profiling information about the frequency of a backedges to separate a loop with multiple latches into a loop nest with the rare backedge being the outer loop. We already use a set of static heuristics based on non-varying arguments to PHI nodes to do the same type of thing.

The motivation is that doing this pulls rarely executed code out of the innermost loop and tends to greatly simplify analysis and optimization of that loop. In particular, it can enable substantial LICM gains when there are clobbering calls down rare paths. The original motivating case for this was the slow path of a safepoint poll, but I believe the possible benefit is more general as well.

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.






_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Chandler Carruth-2

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

Yep.

I've not thought about this a lot, but I have two initial questions that maybe you can answer:

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)
 

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Adam Nemet

On Jan 7, 2015, at 5:33 PM, Chandler Carruth <[hidden email]> wrote:


On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort.  

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree.  

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

Yep.

I've not thought about this a lot, but I have two initial questions that maybe you can answer:

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here…)

I think it’s pretty different from loop distribution/fission.  In loop distribution, in order to split the loop into multiple consecutive loops, you need to reason about the memory references so having a function call makes that difficult/impossible.  Phillip’s idea works around this exact issue.

I explored this space quite a bit recently because we’re working on a proposal to add loop distribution to LLVM.  I hope to send out the proposal this week.

So I see no issue with trying to handle loops with low-probablity function calls with this and fully self-contained loops with classical loop distribution.

Thanks,
Adam

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Chandler Carruth-2

On 01/07/2015 05:33 PM, Chandler Carruth wrote:

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

Yep.

I've not thought about this a lot, but I have two initial questions that maybe you can answer:

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)
I'm not aware of any literature that covers the specific transform I'm suggesting here.  This is probably a lack of awareness on my part, not a lack of literature though.  While I have some basic background in the area, I can't claim to have made an extensive study of loop optimization techniques.  *Particularly* profile guided ones. 

w.r.t. the tradoffs in practice, I'm going to answer this in a separate response just to keep this one short.  I've been playing with both approaches and there appears to be appeal in both.  I'm currently leaning towards a mixture of both approaches. 

 

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
Let me ask a basic question: what do BlockFrequency and BranchProbability compute and roughly what mental cost model should I have?  Once I have a BlockFrequency, what does it mean?  Some of this is documented in BlockFrequencyImpl.h, but the high level bits are mostly missing.  Particularly for BranchProbability. 

I chose to write a custom analysis in large part because I didn't understand the existing analysis and my (possibly completely wrong) perception is that they're overly precise/expensive for what I need.  I wanted to focus on the writing the transform since that's the point I actually wanted to discuss.  :)

  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
I get why you find this a useful mental model, but the transform is far harder to express this way in the couple of cases I've looked at.  It's much easier to start with a latch, reason about it's hotness, and then let the generic loop code construct the cold "region".

Then again, if you were to ask for the set of latches which were dominated by blocks in the cold-with-hot-predecessors list, that might be interesting.  But again, I'd really like to start from the latches since that's what controls the legality/profitability of the transform.  Hm, we could walk back along the dom tree from the latch looking for such a cwhp block... This might be a better way to think about the profitability of the transform, even starting from the latch.  Is that what you were trying to get at to start with?  Or were you looking for something more general?

 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.
You're welcome.  :)

This might be an area where it's profitable to talk in person, are you going to be at the social tonight?

Philip


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Chandler Carruth-2

On Thu, Jan 8, 2015 at 11:37 AM, Philip Reames <[hidden email]> wrote:
This might be an area where it's profitable to talk in person, are you going to be at the social tonight?

just to reply to this most pertinent detail, yes. =D 

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Chandler Carruth-2

On 01/07/2015 05:33 PM, Chandler Carruth wrote:
> How does this compare with classical approaches of loop peeling,
> partitioning, fission, or whatever you might call it?
I'm still developing a good sense for this, but let me lay out some
observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
   x = this->x;
   y = this->y;
   if (x == y) {
     rare();
   }
}

If we could tell x and y were loop invariant, we could unswitch this
loop.  However, the rare call clobbers our view of memory, so LICM
fails, and thus unswitch fails.

We'd like to apply PRE here and push the reload into the loop preheader
and the rare case.  This currently fails for two reasons: 1) We don't
allow code duplication during PRE, and 2) the canonical loop-simplify
form of having a single latch means that PRE doesn't actually see the
rare block at all, it sees the preheader and the join point after the if
block.

I think both of the previous are fixable:
1) This calls for profile guided duplication.  If we can trade one hot
load for two cold ones, that's great.  Also, there might be static
heuristics we could construct.  (e.g. if it's not available in the
preheader and one of many latches...)
2) We need PRE to look back through merge blocks.  This should possibly
even live in MemoryDependenceAnalysis.  Alternatively, we could create a
"deconstruct loop simplify form" pass to run right before GVN to side
step this issue.

Neither is entirely simple to implement, but both are doable. Probably.

In theory, we've now solved the same problem my loop nest transform
does.  There's two snags: one minor(ish), one major.

The minor one is that many of my rare conditions involve volatile
loads.  MemDepAnalysis currently completely gives up when faced with
volatile loads.  This is mostly fixable, but given the structure of the
code may be fairly invasive.  Oddly, LICM actually performs better in
this case.

For the major one, let's tweak our example slightly:
while(c) {
   x = this->x;
   if (x == null) return;
   y = this->y;
   if (y == null) return;
   if (x == y) {
     rare();
   }
}

Let's assume that PRE worked as designed for 'x'.  We now need to
perform a transform analogous to loop unswitch, but a bit different.  We
need to move the condition loop exits into each of the predecessors
where the answers not available.  I don't believe we have anything like
this today, and it involves a non-trivial bit of code.  Even if we get
that done, we're left with a pass ordering problem where we need to run
PRE, then branch-PRE, then PRE, then branch-PRE, etc..  (In practice,
these chains can be quite long due to null checks and range checks.)

If we'd split this into a loop nest structure, we'd still have the
chain, but a) that's easier to control with a custom pass order since
LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do
a bit of trivial loop unswitch for the terminator of the header appears
quite tractable.

One other point worth mentioning is that LLVM does not currently have a
loop peeling pass.  LoopRotate seems to get us most of the way there in
common cases, but not always.  If we did peel the above loop once, we'd
have the same merge block problem I mentioned previously (this time in
the preheader *and* the latch), but the profitability becomes slightly
more clear without the need for profile information or loop specific
heuristics.

To be clear, I'm not opposed to fixing some of the issues above.  I
think that's a great idea, I'm just trying to find my way through to
practical performance improvements with minimal invested effort.  I
probably will in fact fix some of these since they show up in a number
of other cases as well.


Ok, so that's that.  Let's move on to why my proposed loop nest
separation change is a bit scary.

While the loop nest does allow things to be lifted from the inner loop
to the outer loop - without any of the complex PRE or PGO reasoning
required above - it suffers from all of the limitations of the existing
loop optimization passes.  In particular, LICM may not actually be able
to do much if the inner loop is still complicated enough to defeat
either it's fairly simple alias analysis or fault analysis.  GVN/PRE is
strictly more powerful - in theory if not always practice - here than
what we have in LICM.

We've also potentially created a loop nest which is harder to analyse
than the original loop with two latches.  Consider this example:
for(int i = 0; i < 20000) {
    i++;
    if(i % 20 != 0) continue;
    a[i] = 0;
}
(This is not a real example of anything useful.  The important point is
the early return is the common case.)

This loop clearly runs exactly 20000 times.  ScalarEvolution almost
certainly gets this right. (I haven't checked.)

After separation, we'd have this:

int i = 0;
while(true) {
   while(i < 20000) {
      i++;
      if(i % 20 == 0) { // now a loop exit
        a[i] = 0;
        goto next:
      }
}
break:
   next:
}

It's not clear to me that SE will always be able to tell that both the
inner and outer loop runs a maximum of 20000 times.  (In this particular
case it might actually optimize the inner loop substantially and thus
get a more precise answer, but let's ignore that.)  We could potentially
loose information about the loop by doing this transformation.  Of
course, this concern applies to the *existing* loop nest separation
heuristics too.  I don't have any particular reason to believe one
heuristic is worse than the other, but I also have no reason not to
believe that either.

As a final point, the inner loop now has two exiting blocks not one.  I
know at least some of our loop transforms give up on anything with more
than one loop exit.  I'm less concerned by that since a) we should fix
them, and b) the cases I've looked at don't look too hard.


And that's everything that occurs to me at the moment.  I'll probably
have to append in further responses as I remember more details, I know
I'm forgetting some pieces.

Philip



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Adam Nemet

On 01/07/2015 06:42 PM, Adam Nemet wrote:


How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here…)

I think it’s pretty different from loop distribution/fission.  In loop distribution, in order to split the loop into multiple consecutive loops, you need to reason about the memory references so having a function call makes that difficult/impossible.  Phillip’s idea works around this exact issue.

I explored this space quite a bit recently because we’re working on a proposal to add loop distribution to LLVM.  I hope to send out the proposal this week.
Just to note, I'm going to be very interested in seeing this.  I have concerns about profitability, but having a mechanism is clearly a good thing.  :)

Are you also looking at loop fusion?  I've seen that come up in practice for a few cases where it would *really* help.  Generally, filter/reduce patterns implemented by hand.  Nowhere near the top of my priority list, but if you happened to be working on it, I'm happy to help review and brainstorm. 

So I see no issue with trying to handle loops with low-probablity function calls with this and fully self-contained loops with classical loop distribution.

Thanks,
Adam

Some of your points I have quick feedback on:

Points for discussion:

  • Is using profile information for this purpose even a reasonable thing to do?
Yes!
 
  • I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable?
I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).
  • If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of?
  • Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering?
To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.
 
  • Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate?
I'm somewhat concerned about this, but want to think more about the fundamental transformation.
 
  • Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better.

Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Philip Reames-4

On 01/08/2015 12:33 PM, Philip Reames wrote:

>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
> I'm still developing a good sense for this, but let me lay out some
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
> If we could tell x and y were loop invariant, we could unswitch this
> loop.  However, the rare call clobbers our view of memory, so LICM
> fails, and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop
> preheader and the rare case.  This currently fails for two reasons: 1)
> We don't allow code duplication during PRE, and 2) the canonical
> loop-simplify form of having a single latch means that PRE doesn't
> actually see the rare block at all, it sees the preheader and the join
> point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication.  If we can trade one hot
> load for two cold ones, that's great.  Also, there might be static
> heuristics we could construct.  (e.g. if it's not available in the
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks.  This should
> possibly even live in MemoryDependenceAnalysis.  Alternatively, we
> could create a "deconstruct loop simplify form" pass to run right
> before GVN to side step this issue.
Point 2 here may be inaccurate.  I've definitely seen problems here in
the past, but when I sat down to write some test cases, it looks like
the simple merge block for latch (with the load being considered in the
loop header) works just fine.  I suspect my original example is actually
blocked by something else; I'm just not sure what that is yet.

One case where I can get PRE to fail is if there's a load in one of two
paths through the loop, in the preheader, but not in the header.  Here's
the IR:
; Function Attrs: readonly
declare void @hold(i32) #0

declare void @clobber()

define i32 @test1(i1 %cnd, i32* %p) {
entry:
   %v2 = load i32* %p
   call void @hold(i32 %v2)
   br label %header

header:                                           ; preds = %merge, %entry
   br i1 %cnd, label %bb1, label %bb2

bb1:                                              ; preds = %header
   %v1 = load i32* %p
   call void @hold(i32 %v1)
   br label %merge

bb2:                                              ; preds = %header
   call void @clobber()
   br label %merge

merge:                                            ; preds = %bb2, %bb1
   br label %header
}

I haven't looked to see what the issue here is yet.


>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform
> does.  There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile
> loads.  MemDepAnalysis currently completely gives up when faced with
> volatile loads.  This is mostly fixable, but given the structure of
> the code may be fairly invasive.  Oddly, LICM actually performs better
> in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
>   x = this->x;
>   if (x == null) return;
>   y = this->y;
>   if (y == null) return;
>   if (x == y) {
>     rare();
>   }
> }
>
> Let's assume that PRE worked as designed for 'x'.  We now need to
> perform a transform analogous to loop unswitch, but a bit different.  
> We need to move the condition loop exits into each of the predecessors
> where the answers not available.  I don't believe we have anything
> like this today, and it involves a non-trivial bit of code.  Even if
> we get that done, we're left with a pass ordering problem where we
> need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..  
> (In practice, these chains can be quite long due to null checks and
> range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the
> chain, but a) that's easier to control with a custom pass order since
> LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM
> do a bit of trivial loop unswitch for the terminator of the header
> appears quite tractable.
>
> One other point worth mentioning is that LLVM does not currently have
> a loop peeling pass.  LoopRotate seems to get us most of the way there
> in common cases, but not always.  If we did peel the above loop once,
> we'd have the same merge block problem I mentioned previously (this
> time in the preheader *and* the latch), but the profitability becomes
> slightly more clear without the need for profile information or loop
> specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above. I
> think that's a great idea, I'm just trying to find my way through to
> practical performance improvements with minimal invested effort.  I
> probably will in fact fix some of these since they show up in a number
> of other cases as well.
>
>
> Ok, so that's that.  Let's move on to why my proposed loop nest
> separation change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop
> to the outer loop - without any of the complex PRE or PGO reasoning
> required above - it suffers from all of the limitations of the
> existing loop optimization passes.  In particular, LICM may not
> actually be able to do much if the inner loop is still complicated
> enough to defeat either it's fairly simple alias analysis or fault
> analysis.  GVN/PRE is strictly more powerful - in theory if not always
> practice - here than what we have in LICM.
>
> We've also potentially created a loop nest which is harder to analyse
> than the original loop with two latches.  Consider this example:
> for(int i = 0; i < 20000) {
>    i++;
>    if(i % 20 != 0) continue;
>    a[i] = 0;
> }
> (This is not a real example of anything useful.  The important point
> is the early return is the common case.)
>
> This loop clearly runs exactly 20000 times.  ScalarEvolution almost
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
>   while(i < 20000) {
>      i++;
>      if(i % 20 == 0) { // now a loop exit
>        a[i] = 0;
>        goto next:
>      }
> }
> break:
>   next:
> }
>
> It's not clear to me that SE will always be able to tell that both the
> inner and outer loop runs a maximum of 20000 times.  (In this
> particular case it might actually optimize the inner loop
> substantially and thus get a more precise answer, but let's ignore
> that.)  We could potentially loose information about the loop by doing
> this transformation.  Of course, this concern applies to the
> *existing* loop nest separation heuristics too.  I don't have any
> particular reason to believe one heuristic is worse than the other,
> but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.  
> I know at least some of our loop transforms give up on anything with
> more than one loop exit.  I'm less concerned by that since a) we
> should fix them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment.  I'll probably
> have to append in further responses as I remember more details, I know
> I'm forgetting some pieces.
>
> Philip
>
>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Duncan P. N. Exon Smith
In reply to this post by Philip Reames-4

> On 2015-Jan-08, at 11:37, Philip Reames <[hidden email]> wrote:
>
> Let me ask a basic question: what do BlockFrequency and BranchProbability compute and roughly what mental cost model should I have?  Once I have a BlockFrequency, what does it mean?  Some of this is documented in BlockFrequencyImpl.h, but the high level bits are mostly missing.  Particularly for BranchProbability.  

*Branch weight*: An integer attached to an outgoing edge from a basic
block that represents the relative likelihood that it will be taken
over other outgoing edges.

*Branch probability*: Assuming a basic block is executed, what's the
probability it will terminate via a given outgoing edge?  This is the
branch probability for that branch.  Equivalent to the branch weight
divided by the sum of the branch weights for the basic block.

If we don't have profile information, branch weights (and thus,
probabilities) are computed using heuristics in -branch-prob.

*Block frequency*: Assume a function is executed many times -- call this
large number the block frequency of the entry block.  Given that, how
many times is each basic block executed?  This is the block frequency of
each block.  Frequencies are calculated from branch probabilities in
-block-freq.  You can compare relative "hotness" of two basic blocks by
comparing their block frequencies.  (A block frequency means nothing in
isolation -- you need to compare it to a reference block frequency, like
the frequency of a loop header, or of the entry block, or of something
else, for it to have any meaning.)

*Block bias* (PR20316 [1]): Is this basic block more or less likely to
be executed than its "peers"?  The current proposal calculates a
"reference" block frequency by assuming all branch weights are 1 (and,
thus, that all branch probabilities are even), and defines the block
bias to be the ratio of the block frequency to its "reference" block
frequency.  It may have some bitrot, but wouldn't be hard for me to
clean up and commit (if it's even a useful metric).

After I wrote this out, I remembered that I already documented this [2].
(If you think the descriptions I just gave here are more useful than
what's in tree, please improve the in-tree version!)

[1]: http://llvm.org/bugs/show_bug.cgi?id=20316
[2]: http://llvm.org/docs/BlockFrequencyTerminology.html


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Daniel Berlin
In reply to this post by Philip Reames-4


On Thu, Jan 8, 2015 at 2:37 PM, Philip Reames <[hidden email]> wrote:

On 01/07/2015 05:33 PM, Chandler Carruth wrote:

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <[hidden email]> wrote:
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths.  I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. 

The primary case I'm concerned about looks something like this C example:
while( some_condition )
  //do lots of work here
  if (!some_rare_condition) { <-- This is loop variant
    helper();
  }
}

The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined.  This ends up crippling LICM in particular, and GVN to a lessor degree. 

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
  outer:
  helper();
  inner:
  while( some_condition )
    //do lots of work here
    if (!some_rare_condition) { <-- This is loop variant
      goto outer;
    }
  }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

Yep.

I've not thought about this a lot, but I have two initial questions that maybe you can answer:

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...)
I'm not aware of any literature that covers the specific transform I'm suggesting here.  This is probably a lack of awareness on my part, not a lack of literature though. 


This is essentially index set splitting with the goal of it being an enabling optimization for LICM.

(At least, as i've seen it implemented in other compilers)


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Daniel Berlin
In reply to this post by Philip Reames-4


On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <[hidden email]> wrote:

On 01/07/2015 05:33 PM, Chandler Carruth wrote:
How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it?
I'm still developing a good sense for this, but let me lay out some observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


 
If we could tell x and y were loop invariant, we could unswitch this loop.  However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails.

 
We'd like to apply PRE here and push the reload into the loop preheader and the rare case.  This currently fails for two reasons: 1) We don't allow code duplication during PRE,

?????
If we don't, we aren't doing real PRE. So i'm not entirely sure what you mean here.

 
and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block.

 

I think both of the previous are fixable:


GCC's PRE already does the above.
It doesn't do profile guided duplication.
We aren't doing anything special with these blocks.

here is the code I used to test:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
while(c) {
int x = this->x;
int y = this->y;
d += x + y;
if (x == y) {
rare();
}
}
return d;


It will do exactly what you expect, it is transformed into:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
        int pretemp1 = this->x
        int pretemp2 = this->y
     
while(c) {
                pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
                pretemp2phi = phi(rare block pretemp2, preheader pretemp2)

d += pretemp1phi + pretemp2phi
if (x == y) {
rare();
                        pretemp1 = this->x;
                        pretemp2 = this->y;

}
}
return d;
I don't see why profile guided duplication is necessary here. This is a basic load PRE case.  It is handled by the first version of GVN-based Load PRE I wrote for GCC.  It is always a win.

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary.  It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Daniel Berlin

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
  
It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Daniel Berlin

On 01/11/2015 10:17 PM, Daniel Berlin wrote:


On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <[hidden email]> wrote:

On 01/07/2015 05:33 PM, Chandler Carruth wrote:
How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it?
I'm still developing a good sense for this, but let me lay out some observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


 
If we could tell x and y were loop invariant, we could unswitch this loop.  However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails.

 
We'd like to apply PRE here and push the reload into the loop preheader and the rare case.  This currently fails for two reasons: 1) We don't allow code duplication during PRE,

?????
If we don't, we aren't doing real PRE. So i'm not entirely sure what you mean here.
As I think you comment elsewhere in your response, we currently do not duplicate loads; we only move them.  The current structure is based around finding a legal point which doesn't introduce a load along any path where one didn't exist previously.  If we have a path which has two copies of the load, we try to find a place to move one of them such that all paths still have the available load and we've removed the extra load along the one path. 

(Note: The above explanation is rough and does not parallel how the code is organized.)



 
and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block.

 

I think both of the previous are fixable:


GCC's PRE already does the above.
It doesn't do profile guided duplication.
We aren't doing anything special with these blocks.

here is the code I used to test:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
while(c) {
int x = this->x;
int y = this->y;
d += x + y;
if (x == y) {
rare();
}
}
return d;


It will do exactly what you expect, it is transformed into:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
        int pretemp1 = this->x
        int pretemp2 = this->y
     
while(c) {
                pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
                pretemp2phi = phi(rare block pretemp2, preheader pretemp2)

d += pretemp1phi + pretemp2phi
if (x == y) {
rare();
                        pretemp1 = this->x;
                        pretemp2 = this->y;

}
}
return d;
I don't see why profile guided duplication is necessary here. This is a basic load PRE case.  It is handled by the first version of GVN-based Load PRE I wrote for GCC.  It is always a win.
Well, to be pedantic, it is not always profitable.  If this loop runs exactly one iteration, this is both a code size pessimization and possibly a runtime execution pessimization.  While in this particular case, you probably don't care, I'd worry that an algorithm which gets this might also have other pessimization cases which are more worrying. 

However, I think I've actually come to the same underlying conclusion.  This is a case that our PRE algorithm should handle without needing to resort to profiling data. 

I have to admit that I am unfamiliar with the GCC implementation.  Could you outline the algorithm and it's important concepts?  (links are fine too)


Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary.  It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)
Well, I think we've gotten to the same point here.  An improved PRE implementation should handle most of the cases I've actually encountered.  Having said that, I'm not yet willing to say that the profile guided loop splitting isn't *also* worth exploring.  From a practical perspective, a lot of our existing optimizations are structured on loops.  Letting those kick in without falling back to (expensive) GVN might be worthwhile from a practical perspective.  This is as much a pass ordering problem as anything else. 

p.s. I've started to post patches which improve the results given by our current PRE to the llvm-commits list.  So far, it's just simple stuff, but that's the direction I'm current working in.  I'm going to defer looking into the loop nest splitting until I've pushed PRE as far as I easily can. 

Philip


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4
In reply to this post by Daniel Berlin

On 01/11/2015 10:35 PM, Daniel Berlin wrote:

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.  It's more or less directly organized around the classic "available" and "anticipated" concepts from the literature.  The *implementation* has some limitations, but the conceptual framework is there*.  One of the key invariants in classic PRE is that you do not change program behaviour on any path.  As a result, moving (or at most duplicating) loads is pretty much all that happens.

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is.  The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

(If I'm being overly pedantic here, or just flat out wrong, please say so.  I'm still exploring related material and am trying to categorize things in my own head.)

I think we actually in reasonable agreement that we do need to move beyond 'classic PRE' as I called it above.  If you have specific suggestions, positive or negative examples, or other thoughts to share, I'd very much value the input. 

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 
  
It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Mehdi Amini-2

On Jan 12, 2015, at 10:08 AM, Philip Reames <[hidden email]> wrote:


On 01/11/2015 10:35 PM, Daniel Berlin wrote:

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.  It's more or less directly organized around the classic "available" and "anticipated" concepts from the literature.  The *implementation* has some limitations, but the conceptual framework is there*.  One of the key invariants in classic PRE is that you do not change program behaviour on any path.  As a result, moving (or at most duplicating) loads is pretty much all that happens.

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is. 

Did you add some comments? It may save a few hours of squinting to someone else in the future :)

Mehdi



The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

(If I'm being overly pedantic here, or just flat out wrong, please say so.  I'm still exploring related material and am trying to categorize things in my own head.)

I think we actually in reasonable agreement that we do need to move beyond 'classic PRE' as I called it above.  If you have specific suggestions, positive or negative examples, or other thoughts to share, I'd very much value the input. 

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 
  
It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Philip Reames-4

On 01/12/2015 10:35 AM, Mehdi Amini wrote:

On Jan 12, 2015, at 10:08 AM, Philip Reames <[hidden email]> wrote:


On 01/11/2015 10:35 PM, Daniel Berlin wrote:

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary. 

By this i mean we don't actually perform real PRE for loads. It's a known deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be willing to do an insertion here or there to try to move a load in some cases (it will only do a single insertion).  It's not also based on real availability, but on finding nearest dependencies of loads and squinting at them (which is not even really the same as finding where a load stops being available) funny.

It's also not real PRE, just some code that is willing to insert one load to move a load.
Er, I think we have a terminology problem here.  From what I can gather, our current implementation *is* an implementation of classic PRE.  It's more or less directly organized around the classic "available" and "anticipated" concepts from the literature.  The *implementation* has some limitations, but the conceptual framework is there*.  One of the key invariants in classic PRE is that you do not change program behaviour on any path.  As a result, moving (or at most duplicating) loads is pretty much all that happens.

* It took me a few hours of squinting at the code to recognize that weird backwards walk as being an implementation of "anctipated", but it is. 

Did you add some comments? It may save a few hours of squinting to someone else in the future :)
Not yet, but yes, I plan to.

Having said that, part of my trouble was that I didn't have the background to understand what I was seeing at first.  I had to go do some literature search and then things made (slightly) more sense. 

Mehdi



The "available" part comes from MemoryDependenceAnalysis and the forward dataflow algorithm based on those inputs.

If you want to start inserting loads on paths which didn't previously have them - for example the loop preheader in your example in the previous email - this is not classic PRE as I'm familiar with it.  It can be very profitable and worthwhile, but is not classic PRE.   I'm not sure what this variant is known as in the literature. 

(If I'm being overly pedantic here, or just flat out wrong, please say so.  I'm still exploring related material and am trying to categorize things in my own head.)

I think we actually in reasonable agreement that we do need to move beyond 'classic PRE' as I called it above.  If you have specific suggestions, positive or negative examples, or other thoughts to share, I'd very much value the input. 

If you want this case to work, what you need is a real Load PRE implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to get it to kinda sorta work seems to me to be a bad idea.
Not sure I really agree here, but I'm open to pushing as far as we can with static heuristics first.  I suspect that's likely to be more stable w.r.t. optimization effect and applicable to more cases. 
  
It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Krzysztof Parzyszek
In reply to this post by Daniel Berlin
On 1/12/2015 12:04 AM, Daniel Berlin wrote:
>
> This is essentially index set splitting with the goal of it being an
> enabling optimization for LICM.

This is very different from index set splitting.  In ISS you know ahead
of time into what ranges you can divide the values of the induction
variable.  Here you don't.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Krzysztof Parzyszek
In reply to this post by Philip Reames-4
On 1/7/2015 7:19 PM, Philip Reames wrote:

>
> The approach I've been playing with would create this loop structure:
> goto inner
> while (true) {
>    outer:
>    helper();
>    inner:
>    while( some_condition )
>      //do lots of work here
>      if (!some_rare_condition) { <-- This is loop variant
>        goto outer;
>      }
>    }
> }
>

Anything that changes loop structure into something like this, is a
fundamentally bad idea.  This would make the outer loop irreducible,
which should be avoided.

A lot better approach would be to privatize the invariant objects, and
write them back to memory before/after the "helper" is invoked:

     T1 = p->x;
     T2 = P->y;
     while( some_condition )
       //do lots of work here, use T1 and T2
       if (some_rare_condition) { <-- This is loop variant
         p->x = T1;
         p->y = T2;
         helper();
         T1 = p->x;
         T2 = p->y;
       }
     }
     p->x = T1;
     p->y = T2;


-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Krzysztof Parzyszek
Upon closer inspection, the loop will not become irreducible.  However,
the result will be equivalent to reloading the hoisted values after
calling "helper".  In fact, that's what may happen if some CFG
optimization cleans this up.

-Krzysztof


On 1/12/2015 1:00 PM, Krzysztof Parzyszek wrote:

> On 1/7/2015 7:19 PM, Philip Reames wrote:
>>
>> The approach I've been playing with would create this loop structure:
>> goto inner
>> while (true) {
>>    outer:
>>    helper();
>>    inner:
>>    while( some_condition )
>>      //do lots of work here
>>      if (!some_rare_condition) { <-- This is loop variant
>>        goto outer;
>>      }
>>    }
>> }
>>
>
> Anything that changes loop structure into something like this, is a
> fundamentally bad idea.  This would make the outer loop irreducible,
> which should be avoided.
>
> A lot better approach would be to privatize the invariant objects, and
> write them back to memory before/after the "helper" is invoked:
>
>      T1 = p->x;
>      T2 = P->y;
>      while( some_condition )
>        //do lots of work here, use T1 and T2
>        if (some_rare_condition) { <-- This is loop variant
>          p->x = T1;
>          p->y = T2;
>          helper();
>          T1 = p->x;
>          T2 = p->y;
>        }
>      }
>      p->x = T1;
>      p->y = T2;
>
>
> -Krzysztof
>


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Separating loop nests based on profile information?

Daniel Berlin
In reply to this post by Philip Reames-4


On Mon, Jan 12, 2015 at 12:48 PM, Philip Reames <[hidden email]> wrote:

On 01/11/2015 10:17 PM, Daniel Berlin wrote:


On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <[hidden email]> wrote:

On 01/07/2015 05:33 PM, Chandler Carruth wrote:
How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it?
I'm still developing a good sense for this, but let me lay out some observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
  x = this->x;
  y = this->y;
  if (x == y) {
    rare();
  }
}


 
If we could tell x and y were loop invariant, we could unswitch this loop.  However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails.

 
We'd like to apply PRE here and push the reload into the loop preheader and the rare case.  This currently fails for two reasons: 1) We don't allow code duplication during PRE,

?????
If we don't, we aren't doing real PRE. So i'm not entirely sure what you mean here.
As I think you comment elsewhere in your response, we currently do not duplicate loads; we only move them.  The current structure is based around finding a legal point which doesn't introduce a load along any path where one didn't exist previously.  If we have a path which has two copies of the load, we try to find a place to move one of them such that all paths still have the available load and we've removed the extra load along the one path. 

(Note: The above explanation is rough and does not parallel how the code is organized.)




 
and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block.

 

I think both of the previous are fixable:


GCC's PRE already does the above.
It doesn't do profile guided duplication.
We aren't doing anything special with these blocks.

here is the code I used to test:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
while(c) {
int x = this->x;
int y = this->y;
d += x + y;
if (x == y) {
rare();
}
}
return d;


It will do exactly what you expect, it is transformed into:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
        int pretemp1 = this->x
        int pretemp2 = this->y
     
while(c) {
                pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
                pretemp2phi = phi(rare block pretemp2, preheader pretemp2)

d += pretemp1phi + pretemp2phi
if (x == y) {
rare();
                        pretemp1 = this->x;
                        pretemp2 = this->y;

}
}
return d;
I don't see why profile guided duplication is necessary here. This is a basic load PRE case.  It is handled by the first version of GVN-based Load PRE I wrote for GCC.  It is always a win.
Well, to be pedantic, it is not always profitable.  If this loop runs exactly one iteration, this is both a code size pessimization and possibly a runtime execution pessimization. 

Sure, but it is a lifetime optimal PRE case :)
 
While in this particular case, you probably don't care, I'd worry that an algorithm which gets this might also have other pessimization cases which are more worrying. 

Then you shouldn't use any basic PRE at all.
It will always make these decisions.  Profile guided PRE implementations are more complex.

 

However, I think I've actually come to the same underlying conclusion.  This is a case that our PRE algorithm should handle without needing to resort to profiling data. 

I have to admit that I am unfamiliar with the GCC implementation.  Could you outline the algorithm and it's important concepts?  (links are fine too)

It uses an SCC based value numbering implementation(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1723), builds a value expressions, then performs GVNPRE on the value expressions (https://www.cs.purdue.edu/homes/hosking/papers/cc04.pdf)



Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN models are not strong enough to handle this. I'm not sure at all why we think anything else is necessary.  It's certainly not requiring special code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I am seriously confused :)
Well, I think we've gotten to the same point here.  An improved PRE implementation should handle most of the cases I've actually encountered.  Having said that, I'm not yet willing to say that the profile guided loop splitting isn't *also* worth exploring.  From a practical perspective, a lot of our existing optimizations are structured on loops.  Letting those kick in without falling back to (expensive) GVN might be worthwhile from a practical perspective.  This is as much a pass ordering problem as anything else. 

p.s. I've started to post patches which improve the results given by our current PRE to the llvm-commits list.  So far, it's just simple stuff, but that's the direction I'm current working in.  I'm going to defer looking into the loop nest splitting until I've pushed PRE as far as I easily can. 

I truly think you would be better off coming up with a real PRE implementation than trying to push the current one, but it's your time :)
The current PRE is an iterative PRE with a lot of limitations.
 

Philip



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
12