[llvm-dev] PGO is ineffective for Rust - but why?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
I just have a couple suggestions off the top of my head:
- have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
- have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
- just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?

Teresa

On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev


On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
I just have a couple suggestions off the top of my head:
- have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
 
(although note the above shouldn't make the difference between no performance and a typical PGO performance boost) 

Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold). 

- have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
- just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?

Teresa

On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
A couple of things to look at:

1) Do you see any profile mismatch warnings?
2) Use the following options to dump text output of branch probabilities with source information and sanity check if they are good:

     -Rpass=pgo-instrumentation -mllvm -pgo-emit-branch-prob

3) Does Rust code have lots of indirect calls?  use option -Rpass=pgo-icall-prom to see if there are any indirect call promotions happening.

4) Collect perf stats data about taken branches.  With PGO, the result should be much smaller. Otherwise, the block layout is not using any profile data.

5) Using llvm-profdata to dump the profile. What do they look like?

   a) llvm-profdata show --detailed-summary    ...
   b) llvm-profdata show --topn=100 ...
   c) llvm-profdata show --all-functions --ic-targets  ...


David





On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev

On 9/12/19 2:18 AM, Michael Woerister via llvm-dev wrote:

> Hi everyone,
>
> As part of my work for Mozilla's Low Level Tools team I've
> implemented PGO in the Rust compiler. The feature is
> available since Rust 1.37 [1]. However, so far we have not
> seen any actual performance gains from enabling PGO for
> Rust code. Performance even seems to drop 1-3% with PGO
> enabled. I wonder why that is and I'm hoping that someone
> here might have experience debugging PGO effectiveness.
>
>
> PGO in the Rust compiler
> ------------------------
>
> The Rust compiler uses IR-level instrumentation (the
> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> This has worked pretty well and even enables doing PGO for
> mixed Rust/C++ codebases when also using Clang.
>
> The Rust compiler has regression tests that make sure that:
>
> - instrumentation shows up in LLVM IR for the `generate` phase,
>    and that
>
> - profiling data is actually used during the `use` phase, i.e.
>    that cold functions get marked with `cold` and hot functions
>    get marked with `inline`.
>
> I also verified manually that `branch_weights` are being set
> in IR. So, from my perspective, the PGO implementation does
> what it is supposed to do.

One thing missing here is profile guided devirtualization. That's super
significant for Java; it might be highly relevant for Rust as well.

However, I'd still expect to see *some* positive delta with what you've
got, so don't start here.  Your immediate problem is likely something else.

>
> However, as already mentioned, in all benchmarks I've seen so
> far performance seems to stay the same at best and often even
> suffers slightly. Which is suprising because for C++ code
> using Clang's version of IR-level instrumentation & PGO brings
> signifcant gains (up to 5-10% from what I've seen in
> benchmarks for Firefox).
>
> One thing we noticed early on is that disabling the
> pre-inlining pass (`-disable-preinline`) seems to consistently
> improve the situation for Rust code. Doing that we sometimes
> see performance wins of almost 1% over not using PGO. This
> again is very different to C++ where disabling this pass
> causes dramatic performance loses for the Firefox benchmarks.
> And 1% performance improvement is still well below
> expectations, I think.
>
> So my questions to you are:
>
> - Has anybody here observed something similar while
>    wokring on or with PGO?
>
> - Are there certain known characteristics of LLVM IR code
>    that inhibit PGO's effectiveness and that IR produced by
>    `rustc` might exhibit?
Have you checked to make sure *all* of your branches have weights?
Including the ones which don't directly correspond to Rust
conditionals?  If you left off branch weights from range checks or
something (i.e something with a ton of occurrences) that might be
confusing the heuristics enough to explain your results.

>
> - Does anybody know of a good source that describes how to
>    effectively debug a problem like this?
>
> - Does anybody know of a small example program in C/C++
>    that is known to profit from PGO and that could be
>    re-implemented in Rust for comparison?
>
> Thanks a lot for reading! Any help is appreciated.
>
> -Michael
>
> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
Thank you all a lot, Teresa, David, and Philip!

This is giving me quite a todo list of things to check and try out. I'll report back here when I have some findings.

On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:


On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
I just have a couple suggestions off the top of my head:
- have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
 
(although note the above shouldn't make the difference between no performance and a typical PGO performance boost) 

Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold). 

- have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
- just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?

Teresa

On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:

>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended. I wouldn't be surprised if there were some issues around it not passing necessary info to the gold plugin.

Teresa

On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <[hidden email]> wrote:
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.

David

On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev

> By ld do you mean GNU ld?

Yes, GNU ld version 2.31.1 on Fedora 30.

> I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended.

That's good to know! However, in this case no linker plugin is involved. All of LLVM is executed within the Rust compiler and the linker only ever gets to see regular object files.

On Mon, Sep 16, 2019 at 7:07 PM Teresa Johnson <[hidden email]> wrote:
Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended. I wouldn't be surprised if there were some issues around it not passing necessary info to the gold plugin.

Teresa

On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <[hidden email]> wrote:
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
> Can you clarify if performance difference is caused by using different linkers at instrumentation build?

Yes, good observation! Whether the bug occurs depends only on the
linker used for creating the instrumented binary. The linker used
during the "use" phase makes no difference.

> If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.

For the final instrumented executable, it looks like the
`__llvm_prf_data` section is 480 bytes large when using GNU ld, while
it is 528 bytes for gold and lld. The size difference (48 bytes)
incidentally is exactly the size of the `__llvm_prf_data` section in
the object file containing the code that is later missing branch
weights. It looks like the GNU linker loses the `__llvm_prf_data`
section from that object file?

> Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.

`-u__llvm_profile_runtime` is properly passed to the linker,
regardless of which linker it is.

On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:

>
> Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
>
> David
>
> On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>
>> So one interesting observation has already come out of this: I
>> confirmed that `rustc` indeed uses `-ffunction-sections` and
>> `-fdata-sections` on all platforms except for macOS. When trying out
>> different linkers for a small test case [1], however, I found that
>> there were rather large differences in execution time:
>>
>> ld (no PGO) = 172 ms
>> ld (PGO) = 196 ms
>>
>> gold (no PGO) = 182 ms
>> gold (PGO) = 141 ms
>>
>> lld (no PGO) = 193 ms
>> lld (PGO) = 171 ms
>>
>> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
>> linked programs are slower with PGO. I then noticed that branch
>> weights for `ld` were missing from most branches, while the counts for
>> the other linkers are correct. All of this suggests to me that
>> something goes wrong when `ld` tries to link in the profiling runtime.
>>
>> I'll be investigating further.
>>
>> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
>>
>>
>> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>> >
>> >
>> >
>> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>> >>
>> >> I just have a couple suggestions off the top of my head:
>> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>> >
>> >
>> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>> >
>> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>> >
>> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>> >>
>> >> Teresa
>> >>
>> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> As part of my work for Mozilla's Low Level Tools team I've
>> >>> implemented PGO in the Rust compiler. The feature is
>> >>> available since Rust 1.37 [1]. However, so far we have not
>> >>> seen any actual performance gains from enabling PGO for
>> >>> Rust code. Performance even seems to drop 1-3% with PGO
>> >>> enabled. I wonder why that is and I'm hoping that someone
>> >>> here might have experience debugging PGO effectiveness.
>> >>>
>> >>>
>> >>> PGO in the Rust compiler
>> >>> ------------------------
>> >>>
>> >>> The Rust compiler uses IR-level instrumentation (the
>> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>> >>> This has worked pretty well and even enables doing PGO for
>> >>> mixed Rust/C++ codebases when also using Clang.
>> >>>
>> >>> The Rust compiler has regression tests that make sure that:
>> >>>
>> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
>> >>>   and that
>> >>>
>> >>> - profiling data is actually used during the `use` phase, i.e.
>> >>>   that cold functions get marked with `cold` and hot functions
>> >>>   get marked with `inline`.
>> >>>
>> >>> I also verified manually that `branch_weights` are being set
>> >>> in IR. So, from my perspective, the PGO implementation does
>> >>> what it is supposed to do.
>> >>>
>> >>> However, as already mentioned, in all benchmarks I've seen so
>> >>> far performance seems to stay the same at best and often even
>> >>> suffers slightly. Which is suprising because for C++ code
>> >>> using Clang's version of IR-level instrumentation & PGO brings
>> >>> signifcant gains (up to 5-10% from what I've seen in
>> >>> benchmarks for Firefox).
>> >>>
>> >>> One thing we noticed early on is that disabling the
>> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
>> >>> improve the situation for Rust code. Doing that we sometimes
>> >>> see performance wins of almost 1% over not using PGO. This
>> >>> again is very different to C++ where disabling this pass
>> >>> causes dramatic performance loses for the Firefox benchmarks.
>> >>> And 1% performance improvement is still well below
>> >>> expectations, I think.
>> >>>
>> >>> So my questions to you are:
>> >>>
>> >>> - Has anybody here observed something similar while
>> >>>   wokring on or with PGO?
>> >>>
>> >>> - Are there certain known characteristics of LLVM IR code
>> >>>   that inhibit PGO's effectiveness and that IR produced by
>> >>>   `rustc` might exhibit?
>> >>>
>> >>> - Does anybody know of a good source that describes how to
>> >>>   effectively debug a problem like this?
>> >>>
>> >>> - Does anybody know of a small example program in C/C++
>> >>>   that is known to profit from PGO and that could be
>> >>>   re-implemented in Rust for comparison?
>> >>>
>> >>> Thanks a lot for reading! Any help is appreciated.
>> >>>
>> >>> -Michael
>> >>>
>> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>> >>> _______________________________________________
>> >>> LLVM Developers mailing list
>> >>> [hidden email]
>> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | [hidden email] |
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | [hidden email] |
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
Interestingly, a C version of the same test program [1] compiled with
Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
section is the same size for all three linkers. It must be something
specific to the Rust compiler that's going wrong here.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights

On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
<[hidden email]> wrote:

>
> > Can you clarify if performance difference is caused by using different linkers at instrumentation build?
>
> Yes, good observation! Whether the bug occurs depends only on the
> linker used for creating the instrumented binary. The linker used
> during the "use" phase makes no difference.
>
> > If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.
>
> For the final instrumented executable, it looks like the
> `__llvm_prf_data` section is 480 bytes large when using GNU ld, while
> it is 528 bytes for gold and lld. The size difference (48 bytes)
> incidentally is exactly the size of the `__llvm_prf_data` section in
> the object file containing the code that is later missing branch
> weights. It looks like the GNU linker loses the `__llvm_prf_data`
> section from that object file?
>
> > Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.
>
> `-u__llvm_profile_runtime` is properly passed to the linker,
> regardless of which linker it is.
>
> On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:
> >
> > Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
> >
> > David
> >
> > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >>
> >> So one interesting observation has already come out of this: I
> >> confirmed that `rustc` indeed uses `-ffunction-sections` and
> >> `-fdata-sections` on all platforms except for macOS. When trying out
> >> different linkers for a small test case [1], however, I found that
> >> there were rather large differences in execution time:
> >>
> >> ld (no PGO) = 172 ms
> >> ld (PGO) = 196 ms
> >>
> >> gold (no PGO) = 182 ms
> >> gold (PGO) = 141 ms
> >>
> >> lld (no PGO) = 193 ms
> >> lld (PGO) = 171 ms
> >>
> >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> >> linked programs are slower with PGO. I then noticed that branch
> >> weights for `ld` were missing from most branches, while the counts for
> >> the other linkers are correct. All of this suggests to me that
> >> something goes wrong when `ld` tries to link in the profiling runtime.
> >>
> >> I'll be investigating further.
> >>
> >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
> >>
> >>
> >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
> >> >>
> >> >> I just have a couple suggestions off the top of my head:
> >> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
> >> >
> >> >
> >> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
> >> >
> >> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
> >> >
> >> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
> >> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
> >> >>
> >> >> Teresa
> >> >>
> >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> >>>
> >> >>> Hi everyone,
> >> >>>
> >> >>> As part of my work for Mozilla's Low Level Tools team I've
> >> >>> implemented PGO in the Rust compiler. The feature is
> >> >>> available since Rust 1.37 [1]. However, so far we have not
> >> >>> seen any actual performance gains from enabling PGO for
> >> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >> >>> enabled. I wonder why that is and I'm hoping that someone
> >> >>> here might have experience debugging PGO effectiveness.
> >> >>>
> >> >>>
> >> >>> PGO in the Rust compiler
> >> >>> ------------------------
> >> >>>
> >> >>> The Rust compiler uses IR-level instrumentation (the
> >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> >> >>> This has worked pretty well and even enables doing PGO for
> >> >>> mixed Rust/C++ codebases when also using Clang.
> >> >>>
> >> >>> The Rust compiler has regression tests that make sure that:
> >> >>>
> >> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
> >> >>>   and that
> >> >>>
> >> >>> - profiling data is actually used during the `use` phase, i.e.
> >> >>>   that cold functions get marked with `cold` and hot functions
> >> >>>   get marked with `inline`.
> >> >>>
> >> >>> I also verified manually that `branch_weights` are being set
> >> >>> in IR. So, from my perspective, the PGO implementation does
> >> >>> what it is supposed to do.
> >> >>>
> >> >>> However, as already mentioned, in all benchmarks I've seen so
> >> >>> far performance seems to stay the same at best and often even
> >> >>> suffers slightly. Which is suprising because for C++ code
> >> >>> using Clang's version of IR-level instrumentation & PGO brings
> >> >>> signifcant gains (up to 5-10% from what I've seen in
> >> >>> benchmarks for Firefox).
> >> >>>
> >> >>> One thing we noticed early on is that disabling the
> >> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >> >>> improve the situation for Rust code. Doing that we sometimes
> >> >>> see performance wins of almost 1% over not using PGO. This
> >> >>> again is very different to C++ where disabling this pass
> >> >>> causes dramatic performance loses for the Firefox benchmarks.
> >> >>> And 1% performance improvement is still well below
> >> >>> expectations, I think.
> >> >>>
> >> >>> So my questions to you are:
> >> >>>
> >> >>> - Has anybody here observed something similar while
> >> >>>   wokring on or with PGO?
> >> >>>
> >> >>> - Are there certain known characteristics of LLVM IR code
> >> >>>   that inhibit PGO's effectiveness and that IR produced by
> >> >>>   `rustc` might exhibit?
> >> >>>
> >> >>> - Does anybody know of a good source that describes how to
> >> >>>   effectively debug a problem like this?
> >> >>>
> >> >>> - Does anybody know of a small example program in C/C++
> >> >>>   that is known to profit from PGO and that could be
> >> >>>   re-implemented in Rust for comparison?
> >> >>>
> >> >>> Thanks a lot for reading! Any help is appreciated.
> >> >>>
> >> >>> -Michael
> >> >>>
> >> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >> >>> _______________________________________________
> >> >>> LLVM Developers mailing list
> >> >>> [hidden email]
> >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | [hidden email] |
> >> >
> >> >
> >> >
> >> > --
> >> > Teresa Johnson | Software Engineer | [hidden email] |
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> [hidden email]
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev


On Tue, Sep 17, 2019 at 6:25 AM Michael Woerister <[hidden email]> wrote:

> By ld do you mean GNU ld?

Yes, GNU ld version 2.31.1 on Fedora 30.

> I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended.

That's good to know! However, in this case no linker plugin is involved. All of LLVM is executed within the Rust compiler and the linker only ever gets to see regular object files.

Ugh, I was confusing your PGO issue with an LTO issue - there is no plugin involved in non-LTO! And GNU ld should be fine with regular obj files produced by LLVM. Sorry for the confusion!

It sounds like David had the right intuition on what might be going on, I'll let him follow up with you on that as he has a better understanding of the instrumentation side.

Teresa


On Mon, Sep 16, 2019 at 7:07 PM Teresa Johnson <[hidden email]> wrote:
Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended. I wouldn't be surprised if there were some issues around it not passing necessary info to the gold plugin.

Teresa

On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <[hidden email]> wrote:
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
You can check the difference of input args and object files to the linker. 

Regarding gnu ld, it is possible that it triggers another bug relating to start section and garbage collection. A previous bug is here: https://bugs.llvm.org/show_bug.cgi?id=25286

On Tue, Sep 17, 2019 at 8:39 AM Michael Woerister <[hidden email]> wrote:
Interestingly, a C version of the same test program [1] compiled with
Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
section is the same size for all three linkers. It must be something
specific to the Rust compiler that's going wrong here.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights

On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
<[hidden email]> wrote:
>
> > Can you clarify if performance difference is caused by using different linkers at instrumentation build?
>
> Yes, good observation! Whether the bug occurs depends only on the
> linker used for creating the instrumented binary. The linker used
> during the "use" phase makes no difference.
>
> > If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.
>
> For the final instrumented executable, it looks like the
> `__llvm_prf_data` section is 480 bytes large when using GNU ld, while
> it is 528 bytes for gold and lld. The size difference (48 bytes)
> incidentally is exactly the size of the `__llvm_prf_data` section in
> the object file containing the code that is later missing branch
> weights. It looks like the GNU linker loses the `__llvm_prf_data`
> section from that object file?
>
> > Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.
>
> `-u__llvm_profile_runtime` is properly passed to the linker,
> regardless of which linker it is.
>
> On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:
> >
> > Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
> >
> > David
> >
> > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >>
> >> So one interesting observation has already come out of this: I
> >> confirmed that `rustc` indeed uses `-ffunction-sections` and
> >> `-fdata-sections` on all platforms except for macOS. When trying out
> >> different linkers for a small test case [1], however, I found that
> >> there were rather large differences in execution time:
> >>
> >> ld (no PGO) = 172 ms
> >> ld (PGO) = 196 ms
> >>
> >> gold (no PGO) = 182 ms
> >> gold (PGO) = 141 ms
> >>
> >> lld (no PGO) = 193 ms
> >> lld (PGO) = 171 ms
> >>
> >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> >> linked programs are slower with PGO. I then noticed that branch
> >> weights for `ld` were missing from most branches, while the counts for
> >> the other linkers are correct. All of this suggests to me that
> >> something goes wrong when `ld` tries to link in the profiling runtime.
> >>
> >> I'll be investigating further.
> >>
> >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
> >>
> >>
> >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
> >> >>
> >> >> I just have a couple suggestions off the top of my head:
> >> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
> >> >
> >> >
> >> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
> >> >
> >> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
> >> >
> >> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
> >> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
> >> >>
> >> >> Teresa
> >> >>
> >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> >>>
> >> >>> Hi everyone,
> >> >>>
> >> >>> As part of my work for Mozilla's Low Level Tools team I've
> >> >>> implemented PGO in the Rust compiler. The feature is
> >> >>> available since Rust 1.37 [1]. However, so far we have not
> >> >>> seen any actual performance gains from enabling PGO for
> >> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >> >>> enabled. I wonder why that is and I'm hoping that someone
> >> >>> here might have experience debugging PGO effectiveness.
> >> >>>
> >> >>>
> >> >>> PGO in the Rust compiler
> >> >>> ------------------------
> >> >>>
> >> >>> The Rust compiler uses IR-level instrumentation (the
> >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> >> >>> This has worked pretty well and even enables doing PGO for
> >> >>> mixed Rust/C++ codebases when also using Clang.
> >> >>>
> >> >>> The Rust compiler has regression tests that make sure that:
> >> >>>
> >> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
> >> >>>   and that
> >> >>>
> >> >>> - profiling data is actually used during the `use` phase, i.e.
> >> >>>   that cold functions get marked with `cold` and hot functions
> >> >>>   get marked with `inline`.
> >> >>>
> >> >>> I also verified manually that `branch_weights` are being set
> >> >>> in IR. So, from my perspective, the PGO implementation does
> >> >>> what it is supposed to do.
> >> >>>
> >> >>> However, as already mentioned, in all benchmarks I've seen so
> >> >>> far performance seems to stay the same at best and often even
> >> >>> suffers slightly. Which is suprising because for C++ code
> >> >>> using Clang's version of IR-level instrumentation & PGO brings
> >> >>> signifcant gains (up to 5-10% from what I've seen in
> >> >>> benchmarks for Firefox).
> >> >>>
> >> >>> One thing we noticed early on is that disabling the
> >> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >> >>> improve the situation for Rust code. Doing that we sometimes
> >> >>> see performance wins of almost 1% over not using PGO. This
> >> >>> again is very different to C++ where disabling this pass
> >> >>> causes dramatic performance loses for the Firefox benchmarks.
> >> >>> And 1% performance improvement is still well below
> >> >>> expectations, I think.
> >> >>>
> >> >>> So my questions to you are:
> >> >>>
> >> >>> - Has anybody here observed something similar while
> >> >>>   wokring on or with PGO?
> >> >>>
> >> >>> - Are there certain known characteristics of LLVM IR code
> >> >>>   that inhibit PGO's effectiveness and that IR produced by
> >> >>>   `rustc` might exhibit?
> >> >>>
> >> >>> - Does anybody know of a good source that describes how to
> >> >>>   effectively debug a problem like this?
> >> >>>
> >> >>> - Does anybody know of a small example program in C/C++
> >> >>>   that is known to profit from PGO and that could be
> >> >>>   re-implemented in Rust for comparison?
> >> >>>
> >> >>> Thanks a lot for reading! Any help is appreciated.
> >> >>>
> >> >>> -Michael
> >> >>>
> >> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >> >>> _______________________________________________
> >> >>> LLVM Developers mailing list
> >> >>> [hidden email]
> >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | [hidden email] |
> >> >
> >> >
> >> >
> >> > --
> >> > Teresa Johnson | Software Engineer | [hidden email] |
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> [hidden email]
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev


On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.


PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger.


However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
> Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger.

Yes, they are present in the IR generate during the `profile-use`
phase. For my test case you can take a look at the IR generated here:

generate-phase:
https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/outputs/opt_lib_gen.lld.ll
use-phase: https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/outputs/opt_lib_use.lld.ll
source code: https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/opt_lib.rs

However, when using the right linker and thus not running into the GNU
ld bug mentioned earlier, I'm seeing proper speedups with PGO for this
test case. For other (larger) test cases I still don't see speedups so
I'll need to take a closer look at those.

On Tue, Sep 17, 2019 at 8:23 PM Hiroshi Yamauchi <[hidden email]> wrote:

>
>
>
> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>
>> Hi everyone,
>>
>> As part of my work for Mozilla's Low Level Tools team I've
>> implemented PGO in the Rust compiler. The feature is
>> available since Rust 1.37 [1]. However, so far we have not
>> seen any actual performance gains from enabling PGO for
>> Rust code. Performance even seems to drop 1-3% with PGO
>> enabled. I wonder why that is and I'm hoping that someone
>> here might have experience debugging PGO effectiveness.
>>
>>
>> PGO in the Rust compiler
>> ------------------------
>>
>> The Rust compiler uses IR-level instrumentation (the
>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>> This has worked pretty well and even enables doing PGO for
>> mixed Rust/C++ codebases when also using Clang.
>>
>> The Rust compiler has regression tests that make sure that:
>>
>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>   and that
>>
>> - profiling data is actually used during the `use` phase, i.e.
>>   that cold functions get marked with `cold` and hot functions
>>   get marked with `inline`.
>>
>> I also verified manually that `branch_weights` are being set
>> in IR. So, from my perspective, the PGO implementation does
>> what it is supposed to do.
>
>
> Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger.
>
>>
>> However, as already mentioned, in all benchmarks I've seen so
>> far performance seems to stay the same at best and often even
>> suffers slightly. Which is suprising because for C++ code
>> using Clang's version of IR-level instrumentation & PGO brings
>> signifcant gains (up to 5-10% from what I've seen in
>> benchmarks for Firefox).
>>
>> One thing we noticed early on is that disabling the
>> pre-inlining pass (`-disable-preinline`) seems to consistently
>> improve the situation for Rust code. Doing that we sometimes
>> see performance wins of almost 1% over not using PGO. This
>> again is very different to C++ where disabling this pass
>> causes dramatic performance loses for the Firefox benchmarks.
>> And 1% performance improvement is still well below
>> expectations, I think.
>>
>> So my questions to you are:
>>
>> - Has anybody here observed something similar while
>>   wokring on or with PGO?
>>
>> - Are there certain known characteristics of LLVM IR code
>>   that inhibit PGO's effectiveness and that IR produced by
>>   `rustc` might exhibit?
>>
>> - Does anybody know of a good source that describes how to
>>   effectively debug a problem like this?
>>
>> - Does anybody know of a small example program in C/C++
>>   that is known to profit from PGO and that could be
>>   re-implemented in Rust for comparison?
>>
>> Thanks a lot for reading! Any help is appreciated.
>>
>> -Michael
>>
>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
Thank you, Teresa!

Down the road we definitely will want to combine PGO with ThinLTO.



On Tue, Sep 17, 2019 at 5:45 PM Teresa Johnson <[hidden email]> wrote:


On Tue, Sep 17, 2019 at 6:25 AM Michael Woerister <[hidden email]> wrote:

> By ld do you mean GNU ld?

Yes, GNU ld version 2.31.1 on Fedora 30.

> I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended.

That's good to know! However, in this case no linker plugin is involved. All of LLVM is executed within the Rust compiler and the linker only ever gets to see regular object files.

Ugh, I was confusing your PGO issue with an LTO issue - there is no plugin involved in non-LTO! And GNU ld should be fine with regular obj files produced by LLVM. Sorry for the confusion!

It sounds like David had the right intuition on what might be going on, I'll let him follow up with you on that as he has a better understanding of the instrumentation side.

Teresa


On Mon, Sep 16, 2019 at 7:07 PM Teresa Johnson <[hidden email]> wrote:
Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with LLVM's gold plugin, but it's an untested combination and not recommended. I wouldn't be surprised if there were some issues around it not passing necessary info to the gold plugin.

Teresa

On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <[hidden email]> wrote:
So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] |
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
To give a little update here:

- I've been further investigating and found an issue [1] with the
Cargo build tool that most Rust projects use. This issue prevents all
projects using Cargo from properly using PGO because it causes symbol
names to be different between the generate and the use phase. With
this issue fixed the number of "No profile data available for
function" warnings goes down from 92369 to 1167 for the Firefox
codebase.

- I also found that the potential GNU ld bug mentioned above
apparently does not affect Firefox. The number of "No profile data
available for function" warnings is exactly the same for GNU ld and
LLD. I don't know yet where the remaining 1167 warnings come from
though.

- Unfortunately, even with all of the above fixes applied, my medium
sized benchmark still performs worse with PGO than without it. For my
tiny example [2] PGO reduces the number of branch misses by more than
50%. For the medium sized benchmark, however, the PGO version has
slightly *more* branch misses. This seems to indicate that there is
still something wrong.

I will further investigate.

[1] https://github.com/rust-lang/cargo/issues/7416
[2] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights/


On Tue, Sep 17, 2019 at 6:16 PM Xinliang David Li <[hidden email]> wrote:

>
> You can check the difference of input args and object files to the linker.
>
> Regarding gnu ld, it is possible that it triggers another bug relating to start section and garbage collection. A previous bug is here: https://bugs.llvm.org/show_bug.cgi?id=25286
>
> On Tue, Sep 17, 2019 at 8:39 AM Michael Woerister <[hidden email]> wrote:
>>
>> Interestingly, a C version of the same test program [1] compiled with
>> Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
>> section is the same size for all three linkers. It must be something
>> specific to the Rust compiler that's going wrong here.
>>
>> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights
>>
>> On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
>> <[hidden email]> wrote:
>> >
>> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?
>> >
>> > Yes, good observation! Whether the bug occurs depends only on the
>> > linker used for creating the instrumented binary. The linker used
>> > during the "use" phase makes no difference.
>> >
>> > > If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.
>> >
>> > For the final instrumented executable, it looks like the
>> > `__llvm_prf_data` section is 480 bytes large when using GNU ld, while
>> > it is 528 bytes for gold and lld. The size difference (48 bytes)
>> > incidentally is exactly the size of the `__llvm_prf_data` section in
>> > the object file containing the code that is later missing branch
>> > weights. It looks like the GNU linker loses the `__llvm_prf_data`
>> > section from that object file?
>> >
>> > > Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.
>> >
>> > `-u__llvm_profile_runtime` is properly passed to the linker,
>> > regardless of which linker it is.
>> >
>> > On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:
>> > >
>> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
>> > >
>> > > David
>> > >
>> > > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>> > >>
>> > >> So one interesting observation has already come out of this: I
>> > >> confirmed that `rustc` indeed uses `-ffunction-sections` and
>> > >> `-fdata-sections` on all platforms except for macOS. When trying out
>> > >> different linkers for a small test case [1], however, I found that
>> > >> there were rather large differences in execution time:
>> > >>
>> > >> ld (no PGO) = 172 ms
>> > >> ld (PGO) = 196 ms
>> > >>
>> > >> gold (no PGO) = 182 ms
>> > >> gold (PGO) = 141 ms
>> > >>
>> > >> lld (no PGO) = 193 ms
>> > >> lld (PGO) = 171 ms
>> > >>
>> > >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
>> > >> linked programs are slower with PGO. I then noticed that branch
>> > >> weights for `ld` were missing from most branches, while the counts for
>> > >> the other linkers are correct. All of this suggests to me that
>> > >> something goes wrong when `ld` tries to link in the profiling runtime.
>> > >>
>> > >> I'll be investigating further.
>> > >>
>> > >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
>> > >>
>> > >>
>> > >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
>> > >> >>
>> > >> >> I just have a couple suggestions off the top of my head:
>> > >> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
>> > >> >
>> > >> >
>> > >> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
>> > >> >
>> > >> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
>> > >> >
>> > >> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
>> > >> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
>> > >> >>
>> > >> >> Teresa
>> > >> >>
>> > >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
>> > >> >>>
>> > >> >>> Hi everyone,
>> > >> >>>
>> > >> >>> As part of my work for Mozilla's Low Level Tools team I've
>> > >> >>> implemented PGO in the Rust compiler. The feature is
>> > >> >>> available since Rust 1.37 [1]. However, so far we have not
>> > >> >>> seen any actual performance gains from enabling PGO for
>> > >> >>> Rust code. Performance even seems to drop 1-3% with PGO
>> > >> >>> enabled. I wonder why that is and I'm hoping that someone
>> > >> >>> here might have experience debugging PGO effectiveness.
>> > >> >>>
>> > >> >>>
>> > >> >>> PGO in the Rust compiler
>> > >> >>> ------------------------
>> > >> >>>
>> > >> >>> The Rust compiler uses IR-level instrumentation (the
>> > >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>> > >> >>> This has worked pretty well and even enables doing PGO for
>> > >> >>> mixed Rust/C++ codebases when also using Clang.
>> > >> >>>
>> > >> >>> The Rust compiler has regression tests that make sure that:
>> > >> >>>
>> > >> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
>> > >> >>>   and that
>> > >> >>>
>> > >> >>> - profiling data is actually used during the `use` phase, i.e.
>> > >> >>>   that cold functions get marked with `cold` and hot functions
>> > >> >>>   get marked with `inline`.
>> > >> >>>
>> > >> >>> I also verified manually that `branch_weights` are being set
>> > >> >>> in IR. So, from my perspective, the PGO implementation does
>> > >> >>> what it is supposed to do.
>> > >> >>>
>> > >> >>> However, as already mentioned, in all benchmarks I've seen so
>> > >> >>> far performance seems to stay the same at best and often even
>> > >> >>> suffers slightly. Which is suprising because for C++ code
>> > >> >>> using Clang's version of IR-level instrumentation & PGO brings
>> > >> >>> signifcant gains (up to 5-10% from what I've seen in
>> > >> >>> benchmarks for Firefox).
>> > >> >>>
>> > >> >>> One thing we noticed early on is that disabling the
>> > >> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
>> > >> >>> improve the situation for Rust code. Doing that we sometimes
>> > >> >>> see performance wins of almost 1% over not using PGO. This
>> > >> >>> again is very different to C++ where disabling this pass
>> > >> >>> causes dramatic performance loses for the Firefox benchmarks.
>> > >> >>> And 1% performance improvement is still well below
>> > >> >>> expectations, I think.
>> > >> >>>
>> > >> >>> So my questions to you are:
>> > >> >>>
>> > >> >>> - Has anybody here observed something similar while
>> > >> >>>   wokring on or with PGO?
>> > >> >>>
>> > >> >>> - Are there certain known characteristics of LLVM IR code
>> > >> >>>   that inhibit PGO's effectiveness and that IR produced by
>> > >> >>>   `rustc` might exhibit?
>> > >> >>>
>> > >> >>> - Does anybody know of a good source that describes how to
>> > >> >>>   effectively debug a problem like this?
>> > >> >>>
>> > >> >>> - Does anybody know of a small example program in C/C++
>> > >> >>>   that is known to profit from PGO and that could be
>> > >> >>>   re-implemented in Rust for comparison?
>> > >> >>>
>> > >> >>> Thanks a lot for reading! Any help is appreciated.
>> > >> >>>
>> > >> >>> -Michael
>> > >> >>>
>> > >> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>> > >> >>> _______________________________________________
>> > >> >>> LLVM Developers mailing list
>> > >> >>> [hidden email]
>> > >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> --
>> > >> >> Teresa Johnson | Software Engineer | [hidden email] |
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > Teresa Johnson | Software Engineer | [hidden email] |
>> > >> _______________________________________________
>> > >> LLVM Developers mailing list
>> > >> [hidden email]
>> > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
For anyone interested, I have a final update on this topic: I've come
to the conclusion that, with the previously mentioned Cargo issue [1]
fixed, profile-guided optimization now works as expected with Rust. I
have a number of reasons to think so:

- I did some semi-automated investigation of benchmarks that did not
show much of a speedup and was not able to find any missing branch
weights or function call counts. The concrete branch weights that are
easy to predict (error paths in code that does not error during
instrumentation runs) also looked correct to me. I subsequently added
regression tests to the Rust compiler which make sure that branch
weights are correct in a number of basic cases.

- I also investigated indirect call promotion and it seems that
idiomatic Rust code just contains very few indirect calls. I added
regression tests that make sure that indirect call promotion is
correctly performed for the two most common cases, calling through a
function pointer and doing a dynamically dispatched method call.

- Someone brought forth the hypothesis that Rust's much coarser
compilation unit granularity might (partly) explain the difference of
PGO effectiveness compared to C/C++ [2] -- and indeed my experiments
seem to back this hypothesis up. When compiling Rust code for maximum
performance, one usually lets the compiler generate a single object
file per crate, which is equivalent to having a single object file per
static library in C/C++. With this setup, PGO was only able to achieve
an average 0.3% performance improvement in my benchmarks. However,
increasing the number of object files to (roughly) one per source file
led to an average performance improvement of 1.2%, that is, PGO made 4
times as much of a difference. Reducing ThinLTO's import-instr-limit
to 10 magnified the effect even more, making the PGO version about 4%
faster than the non-PGO version, which is well within the range of
improvement that one can expect from PGO. Interestingly, this last
configuration with the stricter import limit was the most performant
one, being also ~3% faster than the single compilation unit setup both
with and without PGO.

In conclusion, (1) there is no evidence that the implementation is
broken and (2) there are a number of cases and configurations that
demonstrate that PGO *can* make as much of a difference as can be
expected from it.

[1] https://github.com/rust-lang/cargo/issues/7416
[2] https://internals.rust-lang.org/t/profile-guided-optimization-how-well-does-it-work-for-you/11108/11

On Tue, Sep 24, 2019 at 5:15 PM Michael Woerister
<[hidden email]> wrote:

>
> To give a little update here:
>
> - I've been further investigating and found an issue [1] with the
> Cargo build tool that most Rust projects use. This issue prevents all
> projects using Cargo from properly using PGO because it causes symbol
> names to be different between the generate and the use phase. With
> this issue fixed the number of "No profile data available for
> function" warnings goes down from 92369 to 1167 for the Firefox
> codebase.
>
> - I also found that the potential GNU ld bug mentioned above
> apparently does not affect Firefox. The number of "No profile data
> available for function" warnings is exactly the same for GNU ld and
> LLD. I don't know yet where the remaining 1167 warnings come from
> though.
>
> - Unfortunately, even with all of the above fixes applied, my medium
> sized benchmark still performs worse with PGO than without it. For my
> tiny example [2] PGO reduces the number of branch misses by more than
> 50%. For the medium sized benchmark, however, the PGO version has
> slightly *more* branch misses. This seems to indicate that there is
> still something wrong.
>
> I will further investigate.
>
> [1] https://github.com/rust-lang/cargo/issues/7416
> [2] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights/
>
>
> On Tue, Sep 17, 2019 at 6:16 PM Xinliang David Li <[hidden email]> wrote:
> >
> > You can check the difference of input args and object files to the linker.
> >
> > Regarding gnu ld, it is possible that it triggers another bug relating to start section and garbage collection. A previous bug is here: https://bugs.llvm.org/show_bug.cgi?id=25286
> >
> > On Tue, Sep 17, 2019 at 8:39 AM Michael Woerister <[hidden email]> wrote:
> >>
> >> Interestingly, a C version of the same test program [1] compiled with
> >> Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
> >> section is the same size for all three linkers. It must be something
> >> specific to the Rust compiler that's going wrong here.
> >>
> >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights
> >>
> >> On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
> >> <[hidden email]> wrote:
> >> >
> >> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?
> >> >
> >> > Yes, good observation! Whether the bug occurs depends only on the
> >> > linker used for creating the instrumented binary. The linker used
> >> > during the "use" phase makes no difference.
> >> >
> >> > > If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.
> >> >
> >> > For the final instrumented executable, it looks like the
> >> > `__llvm_prf_data` section is 480 bytes large when using GNU ld, while
> >> > it is 528 bytes for gold and lld. The size difference (48 bytes)
> >> > incidentally is exactly the size of the `__llvm_prf_data` section in
> >> > the object file containing the code that is later missing branch
> >> > weights. It looks like the GNU linker loses the `__llvm_prf_data`
> >> > section from that object file?
> >> >
> >> > > Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.
> >> >
> >> > `-u__llvm_profile_runtime` is properly passed to the linker,
> >> > regardless of which linker it is.
> >> >
> >> > On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:
> >> > >
> >> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
> >> > >
> >> > > David
> >> > >
> >> > > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> > >>
> >> > >> So one interesting observation has already come out of this: I
> >> > >> confirmed that `rustc` indeed uses `-ffunction-sections` and
> >> > >> `-fdata-sections` on all platforms except for macOS. When trying out
> >> > >> different linkers for a small test case [1], however, I found that
> >> > >> there were rather large differences in execution time:
> >> > >>
> >> > >> ld (no PGO) = 172 ms
> >> > >> ld (PGO) = 196 ms
> >> > >>
> >> > >> gold (no PGO) = 182 ms
> >> > >> gold (PGO) = 141 ms
> >> > >>
> >> > >> lld (no PGO) = 193 ms
> >> > >> lld (PGO) = 171 ms
> >> > >>
> >> > >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> >> > >> linked programs are slower with PGO. I then noticed that branch
> >> > >> weights for `ld` were missing from most branches, while the counts for
> >> > >> the other linkers are correct. All of this suggests to me that
> >> > >> something goes wrong when `ld` tries to link in the profiling runtime.
> >> > >>
> >> > >> I'll be investigating further.
> >> > >>
> >> > >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
> >> > >>
> >> > >>
> >> > >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
> >> > >> >>
> >> > >> >> I just have a couple suggestions off the top of my head:
> >> > >> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
> >> > >> >
> >> > >> >
> >> > >> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
> >> > >> >
> >> > >> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
> >> > >> >
> >> > >> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
> >> > >> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
> >> > >> >>
> >> > >> >> Teresa
> >> > >> >>
> >> > >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> > >> >>>
> >> > >> >>> Hi everyone,
> >> > >> >>>
> >> > >> >>> As part of my work for Mozilla's Low Level Tools team I've
> >> > >> >>> implemented PGO in the Rust compiler. The feature is
> >> > >> >>> available since Rust 1.37 [1]. However, so far we have not
> >> > >> >>> seen any actual performance gains from enabling PGO for
> >> > >> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >> > >> >>> enabled. I wonder why that is and I'm hoping that someone
> >> > >> >>> here might have experience debugging PGO effectiveness.
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> PGO in the Rust compiler
> >> > >> >>> ------------------------
> >> > >> >>>
> >> > >> >>> The Rust compiler uses IR-level instrumentation (the
> >> > >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> >> > >> >>> This has worked pretty well and even enables doing PGO for
> >> > >> >>> mixed Rust/C++ codebases when also using Clang.
> >> > >> >>>
> >> > >> >>> The Rust compiler has regression tests that make sure that:
> >> > >> >>>
> >> > >> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
> >> > >> >>>   and that
> >> > >> >>>
> >> > >> >>> - profiling data is actually used during the `use` phase, i.e.
> >> > >> >>>   that cold functions get marked with `cold` and hot functions
> >> > >> >>>   get marked with `inline`.
> >> > >> >>>
> >> > >> >>> I also verified manually that `branch_weights` are being set
> >> > >> >>> in IR. So, from my perspective, the PGO implementation does
> >> > >> >>> what it is supposed to do.
> >> > >> >>>
> >> > >> >>> However, as already mentioned, in all benchmarks I've seen so
> >> > >> >>> far performance seems to stay the same at best and often even
> >> > >> >>> suffers slightly. Which is suprising because for C++ code
> >> > >> >>> using Clang's version of IR-level instrumentation & PGO brings
> >> > >> >>> signifcant gains (up to 5-10% from what I've seen in
> >> > >> >>> benchmarks for Firefox).
> >> > >> >>>
> >> > >> >>> One thing we noticed early on is that disabling the
> >> > >> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >> > >> >>> improve the situation for Rust code. Doing that we sometimes
> >> > >> >>> see performance wins of almost 1% over not using PGO. This
> >> > >> >>> again is very different to C++ where disabling this pass
> >> > >> >>> causes dramatic performance loses for the Firefox benchmarks.
> >> > >> >>> And 1% performance improvement is still well below
> >> > >> >>> expectations, I think.
> >> > >> >>>
> >> > >> >>> So my questions to you are:
> >> > >> >>>
> >> > >> >>> - Has anybody here observed something similar while
> >> > >> >>>   wokring on or with PGO?
> >> > >> >>>
> >> > >> >>> - Are there certain known characteristics of LLVM IR code
> >> > >> >>>   that inhibit PGO's effectiveness and that IR produced by
> >> > >> >>>   `rustc` might exhibit?
> >> > >> >>>
> >> > >> >>> - Does anybody know of a good source that describes how to
> >> > >> >>>   effectively debug a problem like this?
> >> > >> >>>
> >> > >> >>> - Does anybody know of a small example program in C/C++
> >> > >> >>>   that is known to profit from PGO and that could be
> >> > >> >>>   re-implemented in Rust for comparison?
> >> > >> >>>
> >> > >> >>> Thanks a lot for reading! Any help is appreciated.
> >> > >> >>>
> >> > >> >>> -Michael
> >> > >> >>>
> >> > >> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >> > >> >>> _______________________________________________
> >> > >> >>> LLVM Developers mailing list
> >> > >> >>> [hidden email]
> >> > >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> --
> >> > >> >> Teresa Johnson | Software Engineer | [hidden email] |
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > --
> >> > >> > Teresa Johnson | Software Engineer | [hidden email] |
> >> > >> _______________________________________________
> >> > >> LLVM Developers mailing list
> >> > >> [hidden email]
> >> > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] PGO is ineffective for Rust - but why?

Jeremy Morse via llvm-dev
Thanks for the update. One comment below. Teresa

On Mon, Dec 2, 2019 at 1:15 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
For anyone interested, I have a final update on this topic: I've come
to the conclusion that, with the previously mentioned Cargo issue [1]
fixed, profile-guided optimization now works as expected with Rust. I
have a number of reasons to think so:

- I did some semi-automated investigation of benchmarks that did not
show much of a speedup and was not able to find any missing branch
weights or function call counts. The concrete branch weights that are
easy to predict (error paths in code that does not error during
instrumentation runs) also looked correct to me. I subsequently added
regression tests to the Rust compiler which make sure that branch
weights are correct in a number of basic cases.

- I also investigated indirect call promotion and it seems that
idiomatic Rust code just contains very few indirect calls. I added
regression tests that make sure that indirect call promotion is
correctly performed for the two most common cases, calling through a
function pointer and doing a dynamically dispatched method call.

- Someone brought forth the hypothesis that Rust's much coarser
compilation unit granularity might (partly) explain the difference of
PGO effectiveness compared to C/C++ [2] -- and indeed my experiments
seem to back this hypothesis up. When compiling Rust code for maximum
performance, one usually lets the compiler generate a single object
file per crate, which is equivalent to having a single object file per
static library in C/C++. With this setup, PGO was only able to achieve
an average 0.3% performance improvement in my benchmarks. However,
increasing the number of object files to (roughly) one per source file
led to an average performance improvement of 1.2%, that is, PGO made 4
times as much of a difference. Reducing ThinLTO's import-instr-limit
to 10 magnified the effect even more, making the PGO version about 4%
faster than the non-PGO version, which is well within the range of
improvement that one can expect from PGO. Interestingly, this last
configuration with the stricter import limit was the most performant
one, being also ~3% faster than the single compilation unit setup both
with and without PGO.

This is quite interesting. It suggests that with either a single large compilation unit, or when ThinLTO effectively creates one via lots of importing, that there is over-inlining of things that are presumably not as hot, hurting overall performance. E.g. since the inliner is bottom up, inlining of cold or lukewarm code might be preventing more important inlines further up the call chain, because the function becomes too large. With the split compilation units and more conservative importing, it is presumably importing and therefore inlining the hotter call edges more effectively. I know David has been looking at this type of situation in the inliner.


In conclusion, (1) there is no evidence that the implementation is
broken and (2) there are a number of cases and configurations that
demonstrate that PGO *can* make as much of a difference as can be
expected from it.

[1] https://github.com/rust-lang/cargo/issues/7416
[2] https://internals.rust-lang.org/t/profile-guided-optimization-how-well-does-it-work-for-you/11108/11

On Tue, Sep 24, 2019 at 5:15 PM Michael Woerister
<[hidden email]> wrote:
>
> To give a little update here:
>
> - I've been further investigating and found an issue [1] with the
> Cargo build tool that most Rust projects use. This issue prevents all
> projects using Cargo from properly using PGO because it causes symbol
> names to be different between the generate and the use phase. With
> this issue fixed the number of "No profile data available for
> function" warnings goes down from 92369 to 1167 for the Firefox
> codebase.
>
> - I also found that the potential GNU ld bug mentioned above
> apparently does not affect Firefox. The number of "No profile data
> available for function" warnings is exactly the same for GNU ld and
> LLD. I don't know yet where the remaining 1167 warnings come from
> though.
>
> - Unfortunately, even with all of the above fixes applied, my medium
> sized benchmark still performs worse with PGO than without it. For my
> tiny example [2] PGO reduces the number of branch misses by more than
> 50%. For the medium sized benchmark, however, the PGO version has
> slightly *more* branch misses. This seems to indicate that there is
> still something wrong.
>
> I will further investigate.
>
> [1] https://github.com/rust-lang/cargo/issues/7416
> [2] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights/
>
>
> On Tue, Sep 17, 2019 at 6:16 PM Xinliang David Li <[hidden email]> wrote:
> >
> > You can check the difference of input args and object files to the linker.
> >
> > Regarding gnu ld, it is possible that it triggers another bug relating to start section and garbage collection. A previous bug is here: https://bugs.llvm.org/show_bug.cgi?id=25286
> >
> > On Tue, Sep 17, 2019 at 8:39 AM Michael Woerister <[hidden email]> wrote:
> >>
> >> Interestingly, a C version of the same test program [1] compiled with
> >> Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
> >> section is the same size for all three linkers. It must be something
> >> specific to the Rust compiler that's going wrong here.
> >>
> >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights
> >>
> >> On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
> >> <[hidden email]> wrote:
> >> >
> >> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?
> >> >
> >> > Yes, good observation! Whether the bug occurs depends only on the
> >> > linker used for creating the instrumented binary. The linker used
> >> > during the "use" phase makes no difference.
> >> >
> >> > > If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections.
> >> >
> >> > For the final instrumented executable, it looks like the
> >> > `__llvm_prf_data` section is 480 bytes large when using GNU ld, while
> >> > it is 528 bytes for gold and lld. The size difference (48 bytes)
> >> > incidentally is exactly the size of the `__llvm_prf_data` section in
> >> > the object file containing the code that is later missing branch
> >> > weights. It looks like the GNU linker loses the `__llvm_prf_data`
> >> > section from that object file?
> >> >
> >> > > Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime to force the profile runtime to be linked in.
> >> >
> >> > `-u__llvm_profile_runtime` is properly passed to the linker,
> >> > regardless of which linker it is.
> >> >
> >> > On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <[hidden email]> wrote:
> >> > >
> >> > > Can you clarify if performance difference is caused by using different linkers at instrumentation build?  If that is the case, try dump the sections of the resulting binary and compare __llvm_prf_** sections. Also check the arguments passed to the linker. It should have -u__llvm_profile_runtime   to force the profile runtime to be linked in.
> >> > >
> >> > > David
> >> > >
> >> > > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> > >>
> >> > >> So one interesting observation has already come out of this: I
> >> > >> confirmed that `rustc` indeed uses `-ffunction-sections` and
> >> > >> `-fdata-sections` on all platforms except for macOS. When trying out
> >> > >> different linkers for a small test case [1], however, I found that
> >> > >> there were rather large differences in execution time:
> >> > >>
> >> > >> ld (no PGO) = 172 ms
> >> > >> ld (PGO) = 196 ms
> >> > >>
> >> > >> gold (no PGO) = 182 ms
> >> > >> gold (PGO) = 141 ms
> >> > >>
> >> > >> lld (no PGO) = 193 ms
> >> > >> lld (PGO) = 171 ms
> >> > >>
> >> > >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> >> > >> linked programs are slower with PGO. I then noticed that branch
> >> > >> weights for `ld` were missing from most branches, while the counts for
> >> > >> the other linkers are correct. All of this suggests to me that
> >> > >> something goes wrong when `ld` tries to link in the profiling runtime.
> >> > >>
> >> > >> I'll be investigating further.
> >> > >>
> >> > >> [1] https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
> >> > >>
> >> > >>
> >> > >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <[hidden email]> wrote:
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <[hidden email]> wrote:
> >> > >> >>
> >> > >> >> I just have a couple suggestions off the top of my head:
> >> > >> >> - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions.
> >> > >> >
> >> > >> >
> >> > >> > (although note the above shouldn't make the difference between no performance and a typical PGO performance boost)
> >> > >> >
> >> > >> > Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold).
> >> > >> >
> >> > >> >> - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes.
> >> > >> >> - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes?
> >> > >> >>
> >> > >> >> Teresa
> >> > >> >>
> >> > >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <[hidden email]> wrote:
> >> > >> >>>
> >> > >> >>> Hi everyone,
> >> > >> >>>
> >> > >> >>> As part of my work for Mozilla's Low Level Tools team I've
> >> > >> >>> implemented PGO in the Rust compiler. The feature is
> >> > >> >>> available since Rust 1.37 [1]. However, so far we have not
> >> > >> >>> seen any actual performance gains from enabling PGO for
> >> > >> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >> > >> >>> enabled. I wonder why that is and I'm hoping that someone
> >> > >> >>> here might have experience debugging PGO effectiveness.
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> PGO in the Rust compiler
> >> > >> >>> ------------------------
> >> > >> >>>
> >> > >> >>> The Rust compiler uses IR-level instrumentation (the
> >> > >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> >> > >> >>> This has worked pretty well and even enables doing PGO for
> >> > >> >>> mixed Rust/C++ codebases when also using Clang.
> >> > >> >>>
> >> > >> >>> The Rust compiler has regression tests that make sure that:
> >> > >> >>>
> >> > >> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
> >> > >> >>>   and that
> >> > >> >>>
> >> > >> >>> - profiling data is actually used during the `use` phase, i.e.
> >> > >> >>>   that cold functions get marked with `cold` and hot functions
> >> > >> >>>   get marked with `inline`.
> >> > >> >>>
> >> > >> >>> I also verified manually that `branch_weights` are being set
> >> > >> >>> in IR. So, from my perspective, the PGO implementation does
> >> > >> >>> what it is supposed to do.
> >> > >> >>>
> >> > >> >>> However, as already mentioned, in all benchmarks I've seen so
> >> > >> >>> far performance seems to stay the same at best and often even
> >> > >> >>> suffers slightly. Which is suprising because for C++ code
> >> > >> >>> using Clang's version of IR-level instrumentation & PGO brings
> >> > >> >>> signifcant gains (up to 5-10% from what I've seen in
> >> > >> >>> benchmarks for Firefox).
> >> > >> >>>
> >> > >> >>> One thing we noticed early on is that disabling the
> >> > >> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >> > >> >>> improve the situation for Rust code. Doing that we sometimes
> >> > >> >>> see performance wins of almost 1% over not using PGO. This
> >> > >> >>> again is very different to C++ where disabling this pass
> >> > >> >>> causes dramatic performance loses for the Firefox benchmarks.
> >> > >> >>> And 1% performance improvement is still well below
> >> > >> >>> expectations, I think.
> >> > >> >>>
> >> > >> >>> So my questions to you are:
> >> > >> >>>
> >> > >> >>> - Has anybody here observed something similar while
> >> > >> >>>   wokring on or with PGO?
> >> > >> >>>
> >> > >> >>> - Are there certain known characteristics of LLVM IR code
> >> > >> >>>   that inhibit PGO's effectiveness and that IR produced by
> >> > >> >>>   `rustc` might exhibit?
> >> > >> >>>
> >> > >> >>> - Does anybody know of a good source that describes how to
> >> > >> >>>   effectively debug a problem like this?
> >> > >> >>>
> >> > >> >>> - Does anybody know of a small example program in C/C++
> >> > >> >>>   that is known to profit from PGO and that could be
> >> > >> >>>   re-implemented in Rust for comparison?
> >> > >> >>>
> >> > >> >>> Thanks a lot for reading! Any help is appreciated.
> >> > >> >>>
> >> > >> >>> -Michael
> >> > >> >>>
> >> > >> >>> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >> > >> >>> _______________________________________________
> >> > >> >>> LLVM Developers mailing list
> >> > >> >>> [hidden email]
> >> > >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> --
> >> > >> >> Teresa Johnson | Software Engineer | [hidden email] |
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > --
> >> > >> > Teresa Johnson | Software Engineer | [hidden email] |
> >> > >> _______________________________________________
> >> > >> LLVM Developers mailing list
> >> > >> [hidden email]
> >> > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Teresa Johnson | Software Engineer | [hidden email] |

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
12