[llvm-dev] Making Clang/LLVM faster using code layout optimizations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Making Clang/LLVM faster using code layout optimizations

韩玉 via llvm-dev

We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:

https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md

While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.

More details about BOLT are available in our arXiv.org paper at: https://arxiv.org/pdf/1807.06735.pdf

Maksim & Rafael


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Making Clang/LLVM faster using code layout optimizations

韩玉 via llvm-dev
Hi,


  We, the compiler optimization team at Google, have been working for
the past year on further improving the performance of PGO binaries.
We have also evaluated Facebook’s BOLT in this context and we would
like to share our findings and thoughts.

TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to
improve the performance of our large well optimized (PGO + ThinLTO)
benchmark binaries but suffers from scalability issues. We believe
that a PLO tool assisted by the linker would solve most of these
issues and we are working on some proposals and prototypes towards
this.


Summary of our findings:

* While PGO gives huge performance speedups, there are lots of
instances where profile information is either inaccurate or absent due
to the various  transformations applied. While the profile counts are
precise at the moment they are read, the compiler cannot maintain the
precision of the profiles after some transformations.

* For example, context sensitive profile information is lost as the
counters for collecting profiles are created before the regular
inliner is invoked.

* Optimization passes such as branch lowering, jump threading, loop
transformation passes introduce additional control-flow and the
profiles for these branches are estimated and can be very inaccurate
leading to non-optimal layout of basic blocks.

* Early this year, before we started evaluating BOLT, we worked on
addressing the problem of loss of profile precision from PGO.

* We have found that an iterative PGO solution, where the second
iteration has optimized context sensitive profiles is able to improve
our key benchmarks by 2%, more details below.

* We recently evaluated BOLT, a post link optimizer (PLO) on the same
benchmarks.  BOLT works on the profiles of the final PGO optimized
binary which captures all the context-sensitive and flow-sensitive
information and hence is not prone to the above issues.

* We found that BOLT gives us similar performance boosts, mainly due
to basic block reordering.   BOLT additionally is able to improve the
same benchmarks by 0.5% due to function splitting.

* However BOLT suffers from scalability issues, ~45G of memory and
more than an hour to process a 700M binary. It also needs to update
debug info and requires rigorous testing as it is a stand-alone tool.

* We believe that a PLO that is assisted by the linker to do a large
part of the work is the right direction forward.

* The PLO will focus on the disassembly and analysis and would use the
linker’s assistance at appropriate points to read the binary and write
the modified binary.

* We have some ideas in this space where the PLO can operate with the
assistance of the linker.  Towards this, we are working on a proposal
and prototype for basic block sections.


Iterative PGO experiment:

* Early this year, before we started evaluating BOLT, we worked on
addressing the problem of loss of profile precision from PGO.

* Instrumentation based PGO is known to have context-insensitive
profiles. The profiles after inline transformation are scaled
uniformly based on the call-site count. Early inlining partially
mitigates the problem but it’s not enough.

* The idea here is to have another round of PGO instrumentation after
inline transformation. This profile will only be used by the
optimizations after inline, like basic block layout.

* The key here is that the second pass profile is precise in terms of
context sensitivity and it will not affect the inlining decisions.

* We compared BOLT with this approach and found that with basic block
reordering alone, this approach does as well as BOLT, if not slightly
better.

* We then used BOLT to further optimize this binary and obtained an
additional speedup of 0.5%, because of function splitting which is not
available in the compiler today.  It should be noted that BOLT’s basic
block reordering did not find any additional opportunities for
performance improvement on this iteratively optimized PGO binary.

* Note that BOLT has other optimizations too but the performance
boosts we obtained were only due to BOLT’s basic block reordering
(cache+ algorithm) and function splitting.


A case for a Post Link Optimizer (PLO) assisted by the linker:

* Our experiments with BOLT show that it is able to work on large
binaries, with some caveats and bug fixes, and does manage to squeeze
performance improvements even on very well optimized binaries.

* We believe that a Post Link Optimizer (PLO) like BOLT is very useful
as it provides a framework to do whole program analysis at machine
code level (on information not exposed at IR level) and enables whole
program transformations that requires global ordering (e.g. :
Interprocedural register allocation).

* We have in fact prototyped an interprocedural spill optimization in
BOLT and found non-trivial performance improvements in some of our
micro-benchmarks.

* However, we feel there are shortcomings of BOLT that need to be addressed :

  + Scalability : Time and memory required currently is not within
acceptable limits for widespread deployment.

  + Debug Info:  BOLT requires fixing debug info after the
transformations, it does a good job of it but more needs to be done
here.

  + QA : It is an independent tool and needs rigorous testing.

* We are looking at improving the scalability of BOLT, in fact, we
have been able to reduce the profile conversion time from 40 minutes
to 2 minutes, but there is still a lot to be done.

* We believe that many parts of the PLO can be done with the linker’s
assistance which can mitigate some of the above shortcomings.

* For instance, we are working on a proposal and prototype for basic
block sections which can let the PLO do basic block reordering and
function splitting in a scalable manner with the linker’s help.  We
will share our proposal for this shortly.


Thanks,
Sri





On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev
<[hidden email]> wrote:

>
> We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:
>
> https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md
>
> While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.
>
> More details about BOLT are available in our arXiv.org paper at: https://arxiv.org/pdf/1807.06735.pdf
>
> Maksim & Rafael
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Making Clang/LLVM faster using code layout optimizations

韩玉 via llvm-dev
Thanks for sharing your findings, Sri. Implementing and benchmarking
iterative PGO is particularly insightful, and I'm glad the techniques BOLT
uses were able to achieve similar gains and more with function splitting.

I like the idea of improving BOLT by utilizing information from the linker,
and possible integration with the linker. I assume you mean lld. When
we created BOLT, one of the design goals was to make it work with any
compiler and any linker. However, if we can get extra benefit in performance
and/or usability from integrating and exploiting functionality of lld, I'm all
for it. At the same time, keeping the standalone functionality will be useful
in scenarios where re-linking (especially with LTO/ThinLTO) is undesirable
or impossible. We should also think about platforms where lld is
not (yet) available.

I'm confident with the right design approach, we can achieve both goals,
and make the optimizations part of lld while keeping the standalone tool.

Running into scalability issues when optimizing 700MB of code is
understandable. At the same time, saving memory at processing time wasn't
our top priority, and I'm sure there's lots of room for improvement.

Further optimizations of register allocation sounds like a promising idea.
 
Looking forward to your patches and the proposal.

Maksim

On 10/18/18, 2:20 PM, "Sriraman Tallam" <[hidden email]> wrote:

    Hi,
   
   
      We, the compiler optimization team at Google, have been working for
    the past year on further improving the performance of PGO binaries.
    We have also evaluated Facebook’s BOLT in this context and we would
    like to share our findings and thoughts.
   
    TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to
    improve the performance of our large well optimized (PGO + ThinLTO)
    benchmark binaries but suffers from scalability issues. We believe
    that a PLO tool assisted by the linker would solve most of these
    issues and we are working on some proposals and prototypes towards
    this.
   
   
    Summary of our findings:
   
    * While PGO gives huge performance speedups, there are lots of
    instances where profile information is either inaccurate or absent due
    to the various  transformations applied. While the profile counts are
    precise at the moment they are read, the compiler cannot maintain the
    precision of the profiles after some transformations.
   
    * For example, context sensitive profile information is lost as the
    counters for collecting profiles are created before the regular
    inliner is invoked.
   
    * Optimization passes such as branch lowering, jump threading, loop
    transformation passes introduce additional control-flow and the
    profiles for these branches are estimated and can be very inaccurate
    leading to non-optimal layout of basic blocks.
   
    * Early this year, before we started evaluating BOLT, we worked on
    addressing the problem of loss of profile precision from PGO.
   
    * We have found that an iterative PGO solution, where the second
    iteration has optimized context sensitive profiles is able to improve
    our key benchmarks by 2%, more details below.
   
    * We recently evaluated BOLT, a post link optimizer (PLO) on the same
    benchmarks.  BOLT works on the profiles of the final PGO optimized
    binary which captures all the context-sensitive and flow-sensitive
    information and hence is not prone to the above issues.
   
    * We found that BOLT gives us similar performance boosts, mainly due
    to basic block reordering.   BOLT additionally is able to improve the
    same benchmarks by 0.5% due to function splitting.
   
    * However BOLT suffers from scalability issues, ~45G of memory and
    more than an hour to process a 700M binary. It also needs to update
    debug info and requires rigorous testing as it is a stand-alone tool.
   
    * We believe that a PLO that is assisted by the linker to do a large
    part of the work is the right direction forward.
   
    * The PLO will focus on the disassembly and analysis and would use the
    linker’s assistance at appropriate points to read the binary and write
    the modified binary.
   
    * We have some ideas in this space where the PLO can operate with the
    assistance of the linker.  Towards this, we are working on a proposal
    and prototype for basic block sections.
   
   
    Iterative PGO experiment:
   
    * Early this year, before we started evaluating BOLT, we worked on
    addressing the problem of loss of profile precision from PGO.
   
    * Instrumentation based PGO is known to have context-insensitive
    profiles. The profiles after inline transformation are scaled
    uniformly based on the call-site count. Early inlining partially
    mitigates the problem but it’s not enough.
   
    * The idea here is to have another round of PGO instrumentation after
    inline transformation. This profile will only be used by the
    optimizations after inline, like basic block layout.
   
    * The key here is that the second pass profile is precise in terms of
    context sensitivity and it will not affect the inlining decisions.
   
    * We compared BOLT with this approach and found that with basic block
    reordering alone, this approach does as well as BOLT, if not slightly
    better.
   
    * We then used BOLT to further optimize this binary and obtained an
    additional speedup of 0.5%, because of function splitting which is not
    available in the compiler today.  It should be noted that BOLT’s basic
    block reordering did not find any additional opportunities for
    performance improvement on this iteratively optimized PGO binary.
   
    * Note that BOLT has other optimizations too but the performance
    boosts we obtained were only due to BOLT’s basic block reordering
    (cache+ algorithm) and function splitting.
   
   
    A case for a Post Link Optimizer (PLO) assisted by the linker:
   
    * Our experiments with BOLT show that it is able to work on large
    binaries, with some caveats and bug fixes, and does manage to squeeze
    performance improvements even on very well optimized binaries.
   
    * We believe that a Post Link Optimizer (PLO) like BOLT is very useful
    as it provides a framework to do whole program analysis at machine
    code level (on information not exposed at IR level) and enables whole
    program transformations that requires global ordering (e.g. :
    Interprocedural register allocation).
   
    * We have in fact prototyped an interprocedural spill optimization in
    BOLT and found non-trivial performance improvements in some of our
    micro-benchmarks.
   
    * However, we feel there are shortcomings of BOLT that need to be addressed :
   
      + Scalability : Time and memory required currently is not within
    acceptable limits for widespread deployment.
   
      + Debug Info:  BOLT requires fixing debug info after the
    transformations, it does a good job of it but more needs to be done
    here.
   
      + QA : It is an independent tool and needs rigorous testing.
   
    * We are looking at improving the scalability of BOLT, in fact, we
    have been able to reduce the profile conversion time from 40 minutes
    to 2 minutes, but there is still a lot to be done.
   
    * We believe that many parts of the PLO can be done with the linker’s
    assistance which can mitigate some of the above shortcomings.
   
    * For instance, we are working on a proposal and prototype for basic
    block sections which can let the PLO do basic block reordering and
    function splitting in a scalable manner with the linker’s help.  We
    will share our proposal for this shortly.
   
   
    Thanks,
    Sri
   
   
   
   
   
    On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev
    <[hidden email]> wrote:
    >
    > We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:
    >
    > https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md
    >
    > While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.
    >
    > More details about BOLT are available in our arXiv.org paper at: https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=
    >
    > Maksim & Rafael
    >
    >
    > _______________________________________________
    > LLVM Developers mailing list
    > [hidden email]
    > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=
    >
   

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Making Clang/LLVM faster using code layout optimizations

韩玉 via llvm-dev


On Fri, Oct 19, 2018 at 2:09 PM Maksim Panchenko via llvm-dev <[hidden email]> wrote:
Thanks for sharing your findings, Sri. Implementing and benchmarking
iterative PGO is particularly insightful, and I'm glad the techniques BOLT
uses were able to achieve similar gains and more with function splitting.

I like the idea of improving BOLT by utilizing information from the linker,
and possible integration with the linker. I assume you mean lld. When
we created BOLT, one of the design goals was to make it work with any
compiler and any linker.

lld is certainly one of the main linkers we should add support, but I think we should make the design independent of linker (using common interfaces) from the beginning.
 
However, if we can get extra benefit in performance
and/or usability from integrating and exploiting functionality of lld, I'm all
for it. At the same time, keeping the standalone functionality will be useful
in scenarios where re-linking (especially with LTO/ThinLTO) is undesirable
or impossible. We should also think about platforms where lld is
not (yet) available.


It is possible to make PLO works in the standalone mode as BOLT while also operate in a mode that involves a PLO driver, various optimization components and linker. The latter can also heavily make use of compiler generated annotations. 

Sri and Han plans to send some proposal for discussion in the near future.

thanks,

David


 
I'm confident with the right design approach, we can achieve both goals,
and make the optimizations part of lld while keeping the standalone tool.

Running into scalability issues when optimizing 700MB of code is
understandable. At the same time, saving memory at processing time wasn't
our top priority, and I'm sure there's lots of room for improvement.

Further optimizations of register allocation sounds like a promising idea.

Looking forward to your patches and the proposal.

Maksim

On 10/18/18, 2:20 PM, "Sriraman Tallam" <[hidden email]> wrote:

    Hi,


      We, the compiler optimization team at Google, have been working for
    the past year on further improving the performance of PGO binaries.
    We have also evaluated Facebook’s BOLT in this context and we would
    like to share our findings and thoughts.

    TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to
    improve the performance of our large well optimized (PGO + ThinLTO)
    benchmark binaries but suffers from scalability issues. We believe
    that a PLO tool assisted by the linker would solve most of these
    issues and we are working on some proposals and prototypes towards
    this.


    Summary of our findings:

    * While PGO gives huge performance speedups, there are lots of
    instances where profile information is either inaccurate or absent due
    to the various  transformations applied. While the profile counts are
    precise at the moment they are read, the compiler cannot maintain the
    precision of the profiles after some transformations.

    * For example, context sensitive profile information is lost as the
    counters for collecting profiles are created before the regular
    inliner is invoked.

    * Optimization passes such as branch lowering, jump threading, loop
    transformation passes introduce additional control-flow and the
    profiles for these branches are estimated and can be very inaccurate
    leading to non-optimal layout of basic blocks.

    * Early this year, before we started evaluating BOLT, we worked on
    addressing the problem of loss of profile precision from PGO.

    * We have found that an iterative PGO solution, where the second
    iteration has optimized context sensitive profiles is able to improve
    our key benchmarks by 2%, more details below.

    * We recently evaluated BOLT, a post link optimizer (PLO) on the same
    benchmarks.  BOLT works on the profiles of the final PGO optimized
    binary which captures all the context-sensitive and flow-sensitive
    information and hence is not prone to the above issues.

    * We found that BOLT gives us similar performance boosts, mainly due
    to basic block reordering.   BOLT additionally is able to improve the
    same benchmarks by 0.5% due to function splitting.

    * However BOLT suffers from scalability issues, ~45G of memory and
    more than an hour to process a 700M binary. It also needs to update
    debug info and requires rigorous testing as it is a stand-alone tool.

    * We believe that a PLO that is assisted by the linker to do a large
    part of the work is the right direction forward.

    * The PLO will focus on the disassembly and analysis and would use the
    linker’s assistance at appropriate points to read the binary and write
    the modified binary.

    * We have some ideas in this space where the PLO can operate with the
    assistance of the linker.  Towards this, we are working on a proposal
    and prototype for basic block sections.


    Iterative PGO experiment:

    * Early this year, before we started evaluating BOLT, we worked on
    addressing the problem of loss of profile precision from PGO.

    * Instrumentation based PGO is known to have context-insensitive
    profiles. The profiles after inline transformation are scaled
    uniformly based on the call-site count. Early inlining partially
    mitigates the problem but it’s not enough.

    * The idea here is to have another round of PGO instrumentation after
    inline transformation. This profile will only be used by the
    optimizations after inline, like basic block layout.

    * The key here is that the second pass profile is precise in terms of
    context sensitivity and it will not affect the inlining decisions.

    * We compared BOLT with this approach and found that with basic block
    reordering alone, this approach does as well as BOLT, if not slightly
    better.

    * We then used BOLT to further optimize this binary and obtained an
    additional speedup of 0.5%, because of function splitting which is not
    available in the compiler today.  It should be noted that BOLT’s basic
    block reordering did not find any additional opportunities for
    performance improvement on this iteratively optimized PGO binary.

    * Note that BOLT has other optimizations too but the performance
    boosts we obtained were only due to BOLT’s basic block reordering
    (cache+ algorithm) and function splitting.


    A case for a Post Link Optimizer (PLO) assisted by the linker:

    * Our experiments with BOLT show that it is able to work on large
    binaries, with some caveats and bug fixes, and does manage to squeeze
    performance improvements even on very well optimized binaries.

    * We believe that a Post Link Optimizer (PLO) like BOLT is very useful
    as it provides a framework to do whole program analysis at machine
    code level (on information not exposed at IR level) and enables whole
    program transformations that requires global ordering (e.g. :
    Interprocedural register allocation).

    * We have in fact prototyped an interprocedural spill optimization in
    BOLT and found non-trivial performance improvements in some of our
    micro-benchmarks.

    * However, we feel there are shortcomings of BOLT that need to be addressed :

      + Scalability : Time and memory required currently is not within
    acceptable limits for widespread deployment.

      + Debug Info:  BOLT requires fixing debug info after the
    transformations, it does a good job of it but more needs to be done
    here.

      + QA : It is an independent tool and needs rigorous testing.

    * We are looking at improving the scalability of BOLT, in fact, we
    have been able to reduce the profile conversion time from 40 minutes
    to 2 minutes, but there is still a lot to be done.

    * We believe that many parts of the PLO can be done with the linker’s
    assistance which can mitigate some of the above shortcomings.

    * For instance, we are working on a proposal and prototype for basic
    block sections which can let the PLO do basic block reordering and
    function splitting in a scalable manner with the linker’s help.  We
    will share our proposal for this shortly.


    Thanks,
    Sri





    On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev
    <[hidden email]> wrote:
    >
    > We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:
    >
    > https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md
    >
    > While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.
    >
    > More details about BOLT are available in our arXiv.org paper at: https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=
    >
    > Maksim & Rafael
    >
    >
    > _______________________________________________
    > LLVM Developers mailing list
    > [hidden email]
    > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=
    >


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev