Enabling the vectorizer for -Os

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Enabling the vectorizer for -Os

Nadav Rotem
Hi,

I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits. At the same time the loop vectorizer can increase the code size because of two reasons. First, to vectorize some loops we have to keep the original loop around in order to handle the last few iterations.  Second, on x86 and possibly other targets, the encoding of vector instructions takes more space.

The loop vectorizer is already aware of the ‘optsize’ attribute and it does not vectorize loops which require that we keep the scalar tail. It also does not unroll loops when optimizing for size. It is not obvious but there are many cases in which this conservative kind of vectorization is profitable.  The loop vectorizer does not try to estimate the encoding size of instructions and this is one reason for code growth.

I measured the effects of vectorization on performance and binary size using -Os. I measured the performance on a Sandybridge and compiled our test suite using -mavx -f(no)-vectorize -Os.  As you can see in the attached data there are many workloads that benefit from vectorization.  Not as much as vectorizing with -O3, but still a good number of programs.  At the same time the code growth is minimal.  Most workloads are unaffected and the total code growth for the entire test suite is 0.89%.  Almost all of the code growth comes from the TSVC test suite which contains a large number of large vectorizable loops.  I did not measure the compile time in this batch but I expect to see an increase in compile time in vectorizable loops because of the time we spend in codegen.

I am interested in hearing more opinions and discussing more measurements by other people.

Nadav


.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

VectorizationOsSize.pdf (90K) Download Attachment
VectorizationOsPerf.pdf (80K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Renato Golin-2
On 5 June 2013 04:26, Nadav Rotem <[hidden email]> wrote:
I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits.

Hi Nadav,

As it stands, O2 is very similar to O3 with a few, more aggressive, optimizations running, including the vectorizers. I think this is a good rationale, at O3, I expect the compiler to throw all it's got at the problem. O2 is somewhat more conservative, and people normally use it when they want more stability of the code and results (regarding FP, undefined behaviour, etc). I also use it for finding bugs on the compiler that are introduced by O3, and making them more similar won't help that either. I'm yet to see a good reason to enable the vectorizer by default into O2.

Code size is a different matter, though. I agree that vectorized code can be as small (if not smaller) than scalar code and much more efficient, so there is a clear win to make it on by default under those circumstances. But there are catches that we need to make sure are well understood before we do so.

 
First, to vectorize some loops we have to keep the original loop around in order to handle the last few iterations.

Or if the runtime condition in which it could be vectorize is not valid, in which case you have to run the original.


 Second, on x86 and possibly other targets, the encoding of vector instructions takes more space.

This may be a problem, and maybe the solution is to build a "SizeCostTable" and do the same as we did for the CostTable. Most of the targets would just return 1, but some should override and guess. 

However, on ARM, NEON and VFP are 32-bits (either word or two half-words), but Thumb can be 16-bit or 32-bit. So, you don't have to just model how big the vector instructions will be, but how big the scalar instructions would be, and not all Thumb instructions are of the same size, which makes matters much harder.

In that sense, possibly the SizeCostTable would have to default to 2 (half-words) for most targets, and *also* manipulate scalar code, not just vector, in a special way.


I measured the effects of vectorization on performance and binary size using -Os. I measured the performance on a Sandybridge and compiled our test suite using -mavx -f(no)-vectorize -Os.  As you can see in the attached data there are many workloads that benefit from vectorization.  Not as much as vectorizing with -O3, but still a good number of programs.  At the same time the code growth is minimal.

Would be good to get performance improvements *and* size increase side-by-side in Perf.

Also, our test-suite is famous for having too large a noise, so I'd run it at least 20x each and compare the average (keeping an eye on the std.dev), to make sure the results are meaningful or not.

Again, would be good to have that kind of analysis in Perf, and only warn if the increase/decrease is statistically meaningful.


Most workloads are unaffected and the total code growth for the entire test suite is 0.89%.  Almost all of the code growth comes from the TSVC test suite which contains a large number of large vectorizable loops.  I did not measure the compile time in this batch but I expect to see an increase in compile time in vectorizable loops because of the time we spend in codegen.

I was expecting small growth because of how conservative our vectorizer is. Less than 1% is acceptable, in my view. For ultimate code size, users should use -Oz, which should never have any vectorizer enabled by default anyway.

A few considerations on embedded systems:

* 66% increase in size on an embedded system is not cheap. But LLVM haven't been focusing on that use case so far, and we still have -Oz which does a pretty good job at compressing code (compared to -O3), so even if we do have existing embedded users shaving off bytes, the change in their build system would be minimal.
* Most embedded chips have no vector units, at most single-precision FP units or the like, so vectorization isn't going to be a hit for those architectures anyway.

So, in a nutshell, I agree that -Os could have the vectorizer enabled by default, but I'm yet to see a good reason to do that on -O2.

cheers,
--renato

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

David Tweed

On 5 June 2013 04:26, Nadav Rotem <[hidden email]> wrote:

I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits.

 

Hi Nadav,

 

| As it stands, O2 is very similar to O3 with a few, more aggressive, optimizations running, including the vectorizers. I think this is a good rationale, at O3, I expect the compiler to throw all it's got at the problem. O2 is somewhat more conservative, and

| people normally use it when they want more stability of the code and results (regarding FP, undefined behaviour, etc). I also use it for finding bugs on the compiler that are introduced by O3, and making them more similar won't help that either. I'm yet

| to see a good reason to enable the vectorizer by default into O2.

 

Just to note that I think a lot of people used to the switches from gcc may be coming in with a different "historical expectations". At least recently (at least past 5 years), O2 has in practice been "optimizations that are straightforward enough they do achieve speed-ups" while O3 tends to be "more aggressive optimizations which potentially could cause speed-ups, but don't understand the context/trade-offs well enough so they often don't result in a speed-up". (I've very rarely had O3 optimzation, rather than some program specific subset of the options, acheive any non-noise-level speed-up over O2  with gcc/g++.) I know it's been said that llvm/clang should aim for "validated" O2/O3 settings that  actually do result in better performance, but then I imagine so did gcc... From what I've been seeing I haven't been seeing any instability of code or results from using the vectorizer. (Mind you, I deliberately try to write code to avoid letting chips with "80-bit intermediate floating point values" use them precisely because it can make things more vulnerable to minor compilation changes.)

 

Under that view, if the LLVM vectorizer was well enough understood I would think it would be good to include at O2. However, I suspect that the effects from having effectively two versions of each loop around are probably conflicting enough that it's a better decision to make O3 be the level at which it is blanket enabled.

 

Cheers,

Dave


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Renato Golin-2
In reply to this post by Renato Golin-2
On 5 June 2013 11:59, David Tweed <[hidden email]> wrote:

(I've very rarely had O3 optimzation, rather than some program specific subset of the options, acheive any non-noise-level speed-up over O2  with gcc/g++.)


Hi David,

You surely remember this:


"We find that, while -O2 has a significant impact relative to -O1, the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise."

Under that view, if the LLVM vectorizer was well enough understood I would think it would be good to include at O2. However, I suspect that the effects from having effectively two versions of each loop around are probably conflicting enough that it's a better decision to make O3 be the level at which it is blanket enabled.


My view of O3 is that it *only* regards how aggressive you want to optimize your code. Some special cases are proven to run faster on O3, mostly benchmarks improvements that feed compiler engineers, and on those grounds, O3 can be noticeable if you're allowed to be more aggressive than usual. This is why I mentioned FP-safety, undefined behaviour, vectorization, etc.

I don't expect O3 results to be faster than O2 results on average, but on specific cases where you know that the potential disaster is acceptable, should be fine to assume O3. Most people, though, use O3 (or O9!) in the expectancy that this will be always better. It not being worse than O2 doesn't help, either. ;)

I don't think it's *wrong* to put aut-vec on O2, I just think it's not its place to be, that's all. The potential to change results are there.

cheers,
--renato

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

David Tweed

On 5 June 2013 11:59, David Tweed <[hidden email]> wrote:

(I've very rarely had O3 optimzation, rather than some program specific subset of the options, acheive any non-noise-level speed-up over O2  with gcc/g++.)

[snip]

Ø  "We find that, while -O2 has a significant impact relative to -O1, the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise."

 

That's something I remember well, but there's an obvious question lurking in there: is this because the transformations that apply at O3, while they count as "aggressive", not actually ever transforms to faster code or are they things which are capable of optimizing when used in the right places _but we don't do well at deciding where that is_? I don't have any actual evidence, but I'm inclined towards thinking it's more likely to be the second (and occasionally having looked at gcc assembly it can be seen to have done things like loop unrolling in the most unlikely to be profitable places). So to simplify a lot the difference between O2 and O3 (at least on gcc) might well be the difference between "guaranteed wins only" and "add some transforms that we don't predict the optimization effects of well". At least from some mailing lists I've read other people share that view of the optimization flags in practice, not aggressiveness or stability. Maybe they shouldn't have this "interpretation" in LLVM/clang; I'm just pointing out what some people might expect from previous experience.

Under that view, if the LLVM vectorizer was well enough understood I would think it would be good to include at O2. However, I suspect that the effects from having effectively two versions of each loop around are probably conflicting enough that it's a better decision to make O3 be the level at which it is blanket enabled.

 

Ø  My view of O3 is that it *only* regards how aggressive you want to optimize your code. Some special cases are proven to run faster on O3, mostly benchmarks improvements that feed compiler engineers, and on those grounds, O3 can be noticeable if you're allowed to be more aggressive than usual. This is why I mentioned FP-safety, undefined behaviour, vectorization, etc.

 

 

Again, I can see this as a logical position, I've just never actually encountered differences in FP-safety or undefined behaviour between O2 and O3. Likewise I haven't really seen any instability or undefined behaviour from the vectorizer. (Sorry if I'm sounding a bit pendantic; I've been doing a lot of performance testing/exploration recently so I've been knee deep in the difference between "I'm sure it must be the case that..." expectations and what experimentation reveals is actually happening.)

 

Ø  I don't expect O3 results to be faster than O2 results on average, but on specific cases where you know that the potential disaster is acceptable, should be fine to assume O3. Most people, though, use O3 (or O9!) in the expectancy that this will be always better. It not being worse than O2 doesn't help, either. ;)

 

Again, my experience is that I haven't seen any "semantic" disasters from O3, just that it mostly it doesn't help much, sometimes speeds execution up relative to O2, sometimes slows execution down relative to O2 and certainly increases compile time. It sounds like you've had a wilder ride than me and seen more cases where O3 has actually changed observable behaviour.

 

Ø  I don't think it's *wrong* to put aut-vec on O2, I just think it's not its place to be, that's all. The potential to change results are there.

 

This is what I'd like to know about: what specific potential to change results have you seen in the vectorizer?

 

Cheers,

Dave


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Renato Golin-2
In reply to this post by Renato Golin-2
On 5 June 2013 13:32, David Tweed <[hidden email]> wrote:

This is what I'd like to know about: what specific potential to change results have you seen in the vectorizer?


No changes, just conceptual. AFAIK, the difference between the passes on O2 and O3 are minimal (looking at the code where this is chosen) and they don't seem to be particularly amazing to warrant their special place in On land.

If the argument for having auto-vec on O2 is that O3 makes no difference, than, why have O3 in the first place? Why not make O3 an alias to O2 and solve all problems?

cheers,
--renato

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

David Tweed

On 5 June 2013 13:32, David Tweed <[hidden email]> wrote:

This is what I'd like to know about: what specific potential to change results have you seen in the vectorizer?

 

Ø  No changes, just conceptual. AFAIK, the difference between the passes on O2 and O3 are minimal (looking at the code where this is chosen) and they don't seem to be particularly amazing to warrant their special place in On land.

 

Ø  If the argument for having auto-vec on O2 is that O3 makes no difference, than, why have O3 in the first place? Why not make O3 an alias to O2 and solve all problems?

 

I think I'm managing to express myself unclearly again L For me the practical definition of "O2" is "do transformations which are pretty much guaranteed to actually be optimizations" rather than "do all optimizations which don't carry a risk of disaster".  In which case the argument for or against vectorizing at O2 is whether it's " pretty much guaranteed to actually be an optimization" or not rather than whether it's an aggressive optimization or not. I wouldn't say the argument for auto-vec on O2 isn't that O3 makes no difference, it's whether the intrinsic properties of auto-vec pass fit with the criteria which one uses for enabling passes at O2. I think you were suggesting that "aggressive" transforms don't belong in O2  and auto-vec  is "aggressive", while I tend to think of simplicity/performance-relaiblity as the criteria for O2 and it's unclear if auto-vec fits that.

 

Cheers,

Dave


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Reid Kleckner-2
In reply to this post by Nadav Rotem
On Tue, Jun 4, 2013 at 11:26 PM, Nadav Rotem <[hidden email]> wrote:
The loop vectorizer is already aware of the ‘optsize’ attribute and it does not vectorize loops which require that we keep the scalar tail. It also does not unroll loops when optimizing for size. It is not obvious but there are many cases in which this conservative kind of vectorization is profitable.  The loop vectorizer does not try to estimate the encoding size of instructions and this is one reason for code growth.

Neat, I like this conservative approach to vectorization.  It seems like if it's good enough for -Os it should be good enough for -O2.  I thought the main objections against vectorization at -O2 centered around code bloat and regressions of hot but short loops.  If these heuristics address those concerns and compile time doesn't suffer too much, it seems reasonable to enable at -O2.

My poorly informed 2 cents. 

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Nadav Rotem
In reply to this post by Renato Golin-2
Hi, 

Thanks for the feedback.  I think that we agree that vectorization on -Os can benefit many programs. Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2 to increase the likelihood of improving the performance ?  We have very few regressions on -O3 as is and with better cost models I believe that we can bring them close to zero, so I am not sure if it can help that much.   Renato, I prefer not to estimate the encoding size of instructions. We know that vector instructions take more space to encode. Will knowing the exact number help us in making a better decision ? I don’t think so. On modern processors when running vectorizable loops, the code size of the vector instructions is almost never the bottleneck. 

Thanks,
Nadav


On Jun 5, 2013, at 6:09 AM, David Tweed <[hidden email]> wrote:

On 5 June 2013 13:32, David Tweed <[hidden email]> wrote:
This is what I'd like to know about: what specific potential to change results have you seen in the vectorizer?
 
Ø  No changes, just conceptual. AFAIK, the difference between the passes on O2 and O3 are minimal (looking at the code where this is chosen) and they don't seem to be particularly amazing to warrant their special place in On land.
 
Ø  If the argument for having auto-vec on O2 is that O3 makes no difference, than, why have O3 in the first place? Why not make O3 an alias to O2 and solve all problems?
 
I think I'm managing to express myself unclearly again L For me the practical definition of "O2" is "do transformations which are pretty much guaranteed to actually be optimizations" rather than "do all optimizations which don't carry a risk of disaster".  In which case the argument for or against vectorizing at O2 is whether it's " pretty much guaranteed to actually be an optimization" or not rather than whether it's an aggressive optimization or not. I wouldn't say the argument for auto-vec on O2 isn't that O3 makes no difference, it's whether the intrinsic properties of auto-vec pass fit with the criteria which one uses for enabling passes at O2. I think you were suggesting that "aggressive" transforms don't belong in O2  and auto-vec  is "aggressive", while I tend to think of simplicity/performance-relaiblity as the criteria for O2 and it's unclear if auto-vec fits that.
 
Cheers,
Dave


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Jeffrey Yasskin-2
On Wed, Jun 5, 2013 at 5:51 PM, Nadav Rotem <[hidden email]> wrote:

> Hi,
>
> Thanks for the feedback.  I think that we agree that vectorization on -Os
> can benefit many programs. Regarding -O2 vs -O3, maybe we should set a
> higher cost threshold for O2 to increase the likelihood of improving the
> performance ?  We have very few regressions on -O3 as is and with better
> cost models I believe that we can bring them close to zero, so I am not sure
> if it can help that much.   Renato, I prefer not to estimate the encoding
> size of instructions. We know that vector instructions take more space to
> encode. Will knowing the exact number help us in making a better decision ?
> I don’t think so. On modern processors when running vectorizable loops, the
> code size of the vector instructions is almost never the bottleneck.

You're talking about -Os, where the user has explicitly asked the
compiler to optimize the code size. Saying that the code size isn't a
speed bottleneck seems to miss the point.

>
> On Jun 5, 2013, at 6:09 AM, David Tweed <[hidden email]> wrote:
>
> On 5 June 2013 13:32, David Tweed <[hidden email]> wrote:
>
> This is what I'd like to know about: what specific potential to change
> results have you seen in the vectorizer?
>
>
> Ø  No changes, just conceptual. AFAIK, the difference between the passes on
> O2 and O3 are minimal (looking at the code where this is chosen) and they
> don't seem to be particularly amazing to warrant their special place in On
> land.
>
> Ø  If the argument for having auto-vec on O2 is that O3 makes no difference,
> than, why have O3 in the first place? Why not make O3 an alias to O2 and
> solve all problems?
>
> I think I'm managing to express myself unclearly again L For me the
> practical definition of "O2" is "do transformations which are pretty much
> guaranteed to actually be optimizations" rather than "do all optimizations
> which don't carry a risk of disaster".  In which case the argument for or
> against vectorizing at O2 is whether it's " pretty much guaranteed to
> actually be an optimization" or not rather than whether it's an aggressive
> optimization or not. I wouldn't say the argument for auto-vec on O2 isn't
> that O3 makes no difference, it's whether the intrinsic properties of
> auto-vec pass fit with the criteria which one uses for enabling passes at
> O2. I think you were suggesting that "aggressive" transforms don't belong in
> O2  and auto-vec  is "aggressive", while I tend to think of
> simplicity/performance-relaiblity as the criteria for O2 and it's unclear if
> auto-vec fits that.
>
> Cheers,
> Dave
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Owen Anderson-2

On Jun 5, 2013, at 7:31 PM, Jeffrey Yasskin <[hidden email]> wrote:

> On Wed, Jun 5, 2013 at 5:51 PM, Nadav Rotem <[hidden email]> wrote:
>> Hi,
>>
>> Thanks for the feedback.  I think that we agree that vectorization on -Os
>> can benefit many programs. Regarding -O2 vs -O3, maybe we should set a
>> higher cost threshold for O2 to increase the likelihood of improving the
>> performance ?  We have very few regressions on -O3 as is and with better
>> cost models I believe that we can bring them close to zero, so I am not sure
>> if it can help that much.   Renato, I prefer not to estimate the encoding
>> size of instructions. We know that vector instructions take more space to
>> encode. Will knowing the exact number help us in making a better decision ?
>> I don’t think so. On modern processors when running vectorizable loops, the
>> code size of the vector instructions is almost never the bottleneck.
>
> You're talking about -Os, where the user has explicitly asked the
> compiler to optimize the code size. Saying that the code size isn't a
> speed bottleneck seems to miss the point.

I'm not sure that's a fair characterization.  In Xcode, for example, -Os is the default setting.
My understanding is that -Os is intended to be optimized-without-sacrificing-code-size.  -Oz is where we've being explicitly mandated to prefer code size to all else.

--Owen
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

David Tweed
In reply to this post by Jeffrey Yasskin-2
On Wed, Jun 5, 2013 at 5:51 PM, Nadav Rotem <[hidden email]> wrote:
> Hi,
>
> Thanks for the feedback.  I think that we agree that vectorization on -Os
> can benefit many programs. Regarding -O2 vs -O3, maybe we should set a
> higher cost threshold for O2 to increase the likelihood of improving the
> performance ?  We have very few regressions on -O3 as is and with better
> cost models I believe that we can bring them close to zero, so I am not
sure
> if it can help that much.   Renato, I prefer not to estimate the encoding
> size of instructions. We know that vector instructions take more space to
> encode. Will knowing the exact number help us in making a better decision
?
> I don't think so. On modern processors when running vectorizable loops,
the
> code size of the vector instructions is almost never the bottleneck.

| You're talking about -Os, where the user has explicitly asked the
| compiler to optimize the code size. Saying that the code size isn't a
| speed bottleneck seems to miss the point.

Just to check: reading Nadav's original paragraph, he appears to be talking
about O2 at this point, where the user (in my understanding) only cares
about size indirectly in terms of if it affects performance. Now having said
that I don't actually have a feeling for whether vectorizable code size
affects performance noticeably or not. My suspicion is that in C-family like
languages there's so much other faffing around instructions that any change
is probably lost in the noise. However, for LLVM IR generated directly it
might be noticeable, I'm really don't know.

But if it's a "performance reliable" optimization (as it seems to be) then I
think there's a good case for putting vectorization into the O2 opts.

Cheers,
Dave
>





_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Chandler Carruth-2
In reply to this post by Nadav Rotem
On Wed, Jun 5, 2013 at 5:51 PM, Nadav Rotem <[hidden email]> wrote:
Hi, 

Thanks for the feedback.  I think that we agree that vectorization on -Os can benefit many programs.

FWIW, I don't yet agree.

Your tables show many programs growing in code size by over 20%. While there is associated performance improvements, it isn't clear that this is a good tradeoff. Historically, optimizations which optimize as a direct result of growing code size have *not* been an acceptable tradeoff in -Os.

From Owen's email, a characterization I agree with:
"My understanding is that -Os is intended to be optimized-without-sacrificing-code-size."

The way I would phrase the difference between -Os and -Oz is similar: with -Os we don't *grow* the code size significantly even if it gives significant performance gains, whereas with -Oz we *shrink* the code size even if it means significant performance loss.

Neither of these concepts for -Os would seem to argue for running the vectorizer given the numbers you posted.
 
Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2 to increase the likelihood of improving the performance ?  We have very few regressions on -O3 as is and with better cost models I believe that we can bring them close to zero, so I am not sure if it can help that much.   Renato, I prefer not to estimate the encoding size of instructions. We know that vector instructions take more space to encode. Will knowing the exact number help us in making a better decision ? I don’t think so. On modern processors when running vectorizable loops, the code size of the vector instructions is almost never the bottleneck.

That has specifically not been my experience when dealing with significantly larger and more complex application benchmarks.

The tradeoffs you show in your numbers for -Os are actually exactly what I would expect for -O2: a willingness to grow code size (and compilation time) in order to get performance improvements. A quick eye-balling of the two tables seemed to show most of the size growth had associated performance growth. This, to me, is a good early indicator that the mode of the vectorizer is running in your -Os numbers is what we should look at enabling for -O2.

That said, I would like to see benchmarks from a more diverse set of applications than the nightly test suite. ;] I don't have a lot of faith in it being representative. I'm willing to contribute some that I care about (given enough time to collect the data), but I'd really like for other folks with larger codebases and applications to measure code size and performance artifacts as well.

In order to do this, and ensure we are all measuring the same thing, I think it would be useful to have in Clang flag sets that correspond to the various modes you are proposing. I think they are:

1) -Os + minimal-vectorize (no unrolling, etc)
2) -O2 + minimal-vectorize
3) -O2 + -fvectorize (I think? maybe you have a more specific flag here?)

Does that make sense to you and others?
-Chandler

PS: As a side note, I would personally really like to revisit my proposal to write down what we mean for each optimization level as precisely as we can (acknowledging that this is not very precise; it will always be a judgement call). I think it would help these discussions stay on track.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

David Tweed

| PS: As a side note, I would personally really like to revisit my proposal to write down what we mean for each optimization level as precisely as we can (acknowledging that this is not very precise; it will always be a judgement call). I think it would help

|these discussions stay on track.

 

I think that would be good; I had the points in the previous email discussion you initiated in mind in interpreting O2 vs O3. The one big thing that I think isn't borne in mind quite enough is that we need to not only have a fairly clear idea of what ought to be at each level, but when assigning things we need to run "tests" (even throwaway, informatl ones) to verify that actual real-world behaviour does match the intuition. For example, I don't think the "intent" of gcc's levels is actually that different from what you proposed; the issue is that the behaviour of the actual flags assigned don't result in behaviour that matches the intent. (I still never cease to be amazed when I write a simple "pseudo-benchmark" to generate some numbers to calibrate a performance issue, run them on an actual machine and the actual performance is reversed from what I'd expect.)

 

Cheers,

Dave


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Kristof Beyls
In reply to this post by Chandler Carruth-2

+1, having (even a vague) agreed definition of what the different optimization levels mean would be good.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Chandler Carruth
Sent: 06 June 2013 10:07
To: Nadav Rotem
Cc: Dev
Subject: Re: [LLVMdev] Enabling the vectorizer for -Os

 

PS: As a side note, I would personally really like to revisit my proposal to write down what we mean for each optimization level as precisely as we can (acknowledging that this is not very precise; it will always be a judgement call). I think it would help these discussions stay on track.


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Renato Golin-2
In reply to this post by Chandler Carruth-2
On 6 June 2013 10:57, Kristof Beyls <[hidden email]> wrote:

+1, having (even a vague) agreed definition of what the different optimization levels mean would be good.


+1 too. Most of the discussion on this thread is related to the uncertainties in that definition.

cheers,
--renato

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

JF Bastien
In reply to this post by Nadav Rotem
Will knowing the exact number help us in making a better decision ? I don’t think so. On modern processors when running vectorizable loops, the code size of the vector instructions is almost never the bottleneck. 

I'd make a slightly different point: being able to estimate the number of UOPs will make a big difference if it allows you to fit your loop in the loop stream detector.

So I'd agree that estimating x86 encoded code size doesn't matter that much for performance (though I$ pressure is a big issue for may codebases, but I assume you're talking about tight vectorizable kernels), but estimating UOPs does matter a great deal.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Nadav Rotem
In reply to this post by Chandler Carruth-2
Hi Chandler, 

FWIW, I don't yet agree.

Your tables show many programs growing in code size by over 20%. While there is associated performance improvements, it isn't clear that this is a good tradeoff. Historically, optimizations which optimize as a direct result of growing code size have *not* been an acceptable tradeoff in -Os.

I am glad that you mentioned it. There are only three benchmarks that gained over 1% and there is only one benchmark that gained over 20%: the TSVC workloads.   The TSVC is an HPC benchmark and it is irrelevant for the -Os/-O2 discussion.  If you ignore TSVC you will notice that the code growth due to vectorization is 0.01%  

From Owen's email, a characterization I agree with:
"My understanding is that -Os is intended to be optimized-without-sacrificing-code-size."

The way I would phrase the difference between -Os and -Oz is similar: with -Os we don't *grow* the code size significantly even if it gives significant performance gains, whereas with -Oz we *shrink* the code size even if it means significant performance loss.

Neither of these concepts for -Os would seem to argue for running the vectorizer given the numbers you posted.

0.01% code growth for everything except TSVC sounds pretty good to me.  I would be willing to accept 0.01% code growth to gain 2% on gzip and 9% on RC4. 


Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2 to increase the likelihood of improving the performance ?  We have very few regressions on -O3 as is and with better cost models I believe that we can bring them close to zero, so I am not sure if it can help that much.   Renato, I prefer not to estimate the encoding size of instructions. We know that vector instructions take more space to encode. Will knowing the exact number help us in making a better decision ? I don’t think so. On modern processors when running vectorizable loops, the code size of the vector instructions is almost never the bottleneck.

That has specifically not been my experience when dealing with significantly larger and more complex application benchmarks.

I am constantly benchmarking the compiler and I am aware of a small number of regressions on -O3 using the vectorizer.   If you have a different experience then please share your numbers.   


The tradeoffs you show in your numbers for -Os are actually exactly what I would expect for -O2: a willingness to grow code size (and compilation time) in order to get performance improvements. A quick eye-balling of the two tables seemed to show most of the size growth had associated performance growth. This, to me, is a good early indicator that the mode of the vectorizer is running in your -Os numbers is what we should look at enabling for -O2.

That said, I would like to see benchmarks from a more diverse set of applications than the nightly test suite. ;] I don't have a lot of faith in it being representative. I'm willing to contribute some that I care about (given enough time to collect the data), but I'd really like for other folks with larger codebases and applications to measure code size and performance artifacts as well.

I am looking forward to seeing your contributions to the nightly test suite.  I would also like to see other people benchmark their applications. 

Thanks,
Nadav


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Renato Golin-2
On 6 June 2013 17:08, Nadav Rotem <[hidden email]> wrote:
0.01% code growth for everything except TSVC sounds pretty good to me.  I would be willing to accept 0.01% code growth to gain 2% on gzip and 9% on RC4. 

The test-suite is not a good representation of the general case, but I don't think we can reason based on unknown results.

Chandler, can you enable the vectorizer on the examples you gave and produce a simple size x performance increase?

cheers,
--renato




_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Enabling the vectorizer for -Os

Priyendra Deshwal
In reply to this post by Nadav Rotem
These look like really awesome results :)

I am using clang/LLVM to JIT some code and intuitively our workloads should benefit a lot from vectorization. Is there a way to use apply this optimization to JIT generated code?

Regards,
-- Priyendra



On Tue, Jun 4, 2013 at 8:26 PM, Nadav Rotem <[hidden email]> wrote:
Hi,

I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits. At the same time the loop vectorizer can increase the code size because of two reasons. First, to vectorize some loops we have to keep the original loop around in order to handle the last few iterations.  Second, on x86 and possibly other targets, the encoding of vector instructions takes more space.

The loop vectorizer is already aware of the ‘optsize’ attribute and it does not vectorize loops which require that we keep the scalar tail. It also does not unroll loops when optimizing for size. It is not obvious but there are many cases in which this conservative kind of vectorization is profitable.  The loop vectorizer does not try to estimate the encoding size of instructions and this is one reason for code growth.

I measured the effects of vectorization on performance and binary size using -Os. I measured the performance on a Sandybridge and compiled our test suite using -mavx -f(no)-vectorize -Os.  As you can see in the attached data there are many workloads that benefit from vectorization.  Not as much as vectorizing with -O3, but still a good number of programs.  At the same time the code growth is minimal.  Most workloads are unaffected and the total code growth for the entire test suite is 0.89%.  Almost all of the code growth comes from the TSVC test suite which contains a large number of large vectorizable loops.  I did not measure the compile time in this batch but I expect to see an increase in compile time in vectorizable loops because of the time we spend in codegen.

I am interested in hearing more opinions and discussing more measurements by other people.

Nadav


.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev