[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf


Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH


# Let’s Propeller-optimize the following program:


# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null


# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller


# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
Hi Sriraman,

This is an impressive piece of work! The results look really good, and the document you provided is very thorough. Looking forward to the patches :)

Best,

-- 
Mehdi


On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <[hidden email]> wrote:
Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf


Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH


# Let’s Propeller-optimize the following program:


# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null


# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller


# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <[hidden email]> wrote:
Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf


Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH


# Let’s Propeller-optimize the following program:


# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null


# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller


# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.

The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?

I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.

- Michael Spencer 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.  And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?  It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program.

I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile.  There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation.

-Eli

-----Original Message-----
From: llvm-dev <[hidden email]> On Behalf Of Sriraman Tallam via llvm-dev
Sent: Tuesday, September 24, 2019 4:51 PM
To: [hidden email]
Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf


Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH


# Let’s Propeller-optimize the following program:


# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null


# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller


# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman <[hidden email]> wrote:
>
> My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.  And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?

I have mentioned this in the RFC under alternate design in the
Experments.  We did consider this approach.  This is our first attempt
at doing this end to end. We plan to do much more.  We are looking at
much more aggressive inter-procedural code layout optimizations based
on path profiles and machine learning, which requires having support
for basic block sections. Being able to do inter-procedural block
layout is already showing more performance improvements over what we
have right now.    We don't think basic block sections is expensive if
done on demand and there are opportunities for performance exploration
with this support.

It might take more CPU time to re-do LTO code generation, but that's
probably not a big deal for applications where the user is willing to
collect two different kinds PGO of profiles for a program.

>
> I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile.  There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation.
>
> -Eli
>
> -----Original Message-----
> From: llvm-dev <[hidden email]> On Behalf Of Sriraman Tallam via llvm-dev
> Sent: Tuesday, September 24, 2019 4:51 PM
> To: [hidden email]
> Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
>
> Greetings,
>
> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
> framework, on large google benchmarks and noticed that it improves key
> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
> as this is over and above a baseline binaryalready heavily optimized with
> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
> profile guided and does very aggressive performance optimizations, there is
> more room for performance improvements due to profile approximations while
> applying the transformations.  BOLT uses exact profiles from the final binary
> and is able to fill the gaps left by ThinLTO + PGO. The performance
> improvements due to BOLT come from basic block layout, function reordering and
> function splitting.
>
> While BOLT does an excellent job of squeezing extra performance from highly
> optimized binaries with optimizations such as code layout, it has these major
> issues:
>
>  * It does not take advantage of distributed build systems.
>  * It has scalability issues and to rewrite a binary with a ~300M text segment
> size:
>  * Memory foot-print is 70G.
>  * It takes more than 10 minutes to rewrite the binary.
>
> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
> original binary, optimizes and rewrites the final binary in one process.  This
> limits the scalability of BOLT and the memory and time overhead shoots up
> quickly for large binaries.
>
> Inspired by the performance gains and to address the scalability issue of BOLT,
> we went about designing a scalable infrastructure that can perform BOLT-like
> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
> which can perform profile guided link time binary optimizations in a scalable
> way and is friendly to distributed build systems.  Our system leverages the
> existing capabilities of the compiler tool-chain and is not a stand alone tool.
> Like BOLT, our system boosts the performance of optimized binaries via
> link-time optimizations using accurate profiles of the binary. We discuss the
> Propeller system and show how to do the whole program basic block layout using
> Propeller.
>
> Propeller does whole program basic block layout at link time via basic block
> sections.  We have added support for having each basic block in its own section
> which allows the linker to do arbitrary reorderings of basic blocks to achieve
> any desired fine-grain code layout which includes block layout, function
> splitting and function reordering.  Our experiments on large real-world
> applications and SPEC with code layout show that Propeller can optimize as
> effectively as BOLT, with just 20% of its memory footprint and time overhead.
>
> An LLVM branch with propeller patches is available in the git repository here:
> https://github.com/google/llvm-propeller/  We will upload individual patches of
> the various elements for review.  We have attached a google doc describing the
> Propeller system with Experiments in detail,
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
>
> Quick Start - How to optimize further with Propeller?
>
> # git clone and build repo
>
> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git
>
> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>
> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
>    $LLVM_DIR/llvm-propeller/llvm
>
> $ ninja -j25
>
> $ export PATH=$BUILD_DIR/bin:$PATH
>
>
> # Let’s Propeller-optimize the following program:
>
>
> # Step 1: Build the peak optimized binary with an additional flag.
>
> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld
>
> # Step 2: Profile the binary, only one side of the branch is executed.
> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null
>
>
> # Step 3: Convert the profiles using the tool provided
> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>
>
> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld
>
> In Step 4, the optimized bit code can be used if it is saved in Step1 as
> Propeller is active only during compile backend and link.  The optimized binary
> has a different layout of the basic blocks in main to keep the executed blocks
> together and split the cold blocks.
>
> Some of the key points:
>
> +  Added support for basic block sections, similar to function sections, where
> each basic block can reside in its own section.
>
> +  Jump instructions need 32-bit relocations and subsequent linker relaxations
> after basic block layout.  We would like to add a new relocation type for jump
> instructions to make it easier for relaxations and guarantee correctness.
>
> +  Added support in the linker to read profiles (PMU LBR) and discover optimal
> basic block layout using the Ex-TSP algorithm described here:
> https://arxiv.org/abs/1809.04676
>
> +  Added support in the linker to re-order basic block sections in any
> specified sequence (can be done with symbol ordering file).  This requires
> linker relaxations to delete and shrink branches across basic blocks.
>
> +  Compared our system to BOLT  and have shown that our system can produce
> similar performance improvements as BOLT but with much less memory and time
> overheads.  Our experiments are on very large warehouse-scale benchmarks and
> SPEC 2017.
>
> +  Listed why this cannot be done as part of PGO itself.  Post Link
> optimizations are able to transform the binary using accurate profiles and PGO
> passes suffer from profile imprecision.
>
> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
> via basic block sections.
>
> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
> binary sizes much more than DebugInfo due to lack of support for discontiguous
> address ranges.  We have talked about this and would like to make a case to
> support discontiguous ranges with CFI which will make basic block sections much
> more cheaper.
>
> Detailed RFC document here :
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
> Please let us know what you think,
> Thanks
> Sri on behalf of the team.
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev


On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <[hidden email]> wrote:
My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. 

This was considered for Propeller.  This  is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.
 
And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description? 

One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.

David

 
It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program.

I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile.  There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation.

-Eli

-----Original Message-----
From: llvm-dev <[hidden email]> On Behalf Of Sriraman Tallam via llvm-dev
Sent: Tuesday, September 24, 2019 4:51 PM
To: [hidden email]
Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf


Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH


# Let’s Propeller-optimize the following program:


# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null


# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller


# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <[hidden email]> wrote:

>
> On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <[hidden email]> wrote:
>>
>> Greetings,
>>
>> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
>> framework, on large google benchmarks and noticed that it improves key
>> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
>> as this is over and above a baseline binaryalready heavily optimized with
>> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
>> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
>> profile guided and does very aggressive performance optimizations, there is
>> more room for performance improvements due to profile approximations while
>> applying the transformations.  BOLT uses exact profiles from the final binary
>> and is able to fill the gaps left by ThinLTO + PGO. The performance
>> improvements due to BOLT come from basic block layout, function reordering and
>> function splitting.
>>
>> While BOLT does an excellent job of squeezing extra performance from highly
>> optimized binaries with optimizations such as code layout, it has these major
>> issues:
>>
>>  * It does not take advantage of distributed build systems.
>>  * It has scalability issues and to rewrite a binary with a ~300M text segment
>> size:
>>  * Memory foot-print is 70G.
>>  * It takes more than 10 minutes to rewrite the binary.
>>
>> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
>> original binary, optimizes and rewrites the final binary in one process.  This
>> limits the scalability of BOLT and the memory and time overhead shoots up
>> quickly for large binaries.
>>
>> Inspired by the performance gains and to address the scalability issue of BOLT,
>> we went about designing a scalable infrastructure that can perform BOLT-like
>> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
>> which can perform profile guided link time binary optimizations in a scalable
>> way and is friendly to distributed build systems.  Our system leverages the
>> existing capabilities of the compiler tool-chain and is not a stand alone tool.
>> Like BOLT, our system boosts the performance of optimized binaries via
>> link-time optimizations using accurate profiles of the binary. We discuss the
>> Propeller system and show how to do the whole program basic block layout using
>> Propeller.
>>
>> Propeller does whole program basic block layout at link time via basic block
>> sections.  We have added support for having each basic block in its own section
>> which allows the linker to do arbitrary reorderings of basic blocks to achieve
>> any desired fine-grain code layout which includes block layout, function
>> splitting and function reordering.  Our experiments on large real-world
>> applications and SPEC with code layout show that Propeller can optimize as
>> effectively as BOLT, with just 20% of its memory footprint and time overhead.
>>
>> An LLVM branch with propeller patches is available in the git repository here:
>> https://github.com/google/llvm-propeller/  We will upload individual patches of
>> the various elements for review.  We have attached a google doc describing the
>> Propeller system with Experiments in detail,
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>>
>> Quick Start - How to optimize further with Propeller?
>>
>> # git clone and build repo
>>
>> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git
>>
>> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>>
>> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
>>    $LLVM_DIR/llvm-propeller/llvm
>>
>> $ ninja -j25
>>
>> $ export PATH=$BUILD_DIR/bin:$PATH
>>
>>
>> # Let’s Propeller-optimize the following program:
>>
>>
>> # Step 1: Build the peak optimized binary with an additional flag.
>>
>> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld
>>
>> # Step 2: Profile the binary, only one side of the branch is executed.
>> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null
>>
>>
>> # Step 3: Convert the profiles using the tool provided
>> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>>
>>
>> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
>> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld
>>
>> In Step 4, the optimized bit code can be used if it is saved in Step1 as
>> Propeller is active only during compile backend and link.  The optimized binary
>> has a different layout of the basic blocks in main to keep the executed blocks
>> together and split the cold blocks.
>>
>> Some of the key points:
>>
>> +  Added support for basic block sections, similar to function sections, where
>> each basic block can reside in its own section.
>>
>> +  Jump instructions need 32-bit relocations and subsequent linker relaxations
>> after basic block layout.  We would like to add a new relocation type for jump
>> instructions to make it easier for relaxations and guarantee correctness.
>>
>> +  Added support in the linker to read profiles (PMU LBR) and discover optimal
>> basic block layout using the Ex-TSP algorithm described here:
>> https://arxiv.org/abs/1809.04676
>>
>> +  Added support in the linker to re-order basic block sections in any
>> specified sequence (can be done with symbol ordering file).  This requires
>> linker relaxations to delete and shrink branches across basic blocks.
>>
>> +  Compared our system to BOLT  and have shown that our system can produce
>> similar performance improvements as BOLT but with much less memory and time
>> overheads.  Our experiments are on very large warehouse-scale benchmarks and
>> SPEC 2017.
>>
>> +  Listed why this cannot be done as part of PGO itself.  Post Link
>> optimizations are able to transform the binary using accurate profiles and PGO
>> passes suffer from profile imprecision.
>>
>> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
>> via basic block sections.
>>
>> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
>> binary sizes much more than DebugInfo due to lack of support for discontiguous
>> address ranges.  We have talked about this and would like to make a case to
>> support discontiguous ranges with CFI which will make basic block sections much
>> more cheaper.
>>
>> Detailed RFC document here :
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>> Please let us know what you think,
>> Thanks
>> Sri on behalf of the team.
>
>
> The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?
>
> I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.

Long term, we are looking at inter-procedural basic block layout,
basic block cloning etc. and this encapsulates block layout, function
splitting and function reordering.  We will look at how we could
integrate this with existing callgraph layout algorithm.

>
> - Michael Spencer
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <[hidden email]> wrote:

>
> On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <[hidden email]> wrote:
>>
>> Greetings,
>>
>> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
>> framework, on large google benchmarks and noticed that it improves key
>> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
>> as this is over and above a baseline binaryalready heavily optimized with
>> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
>> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
>> profile guided and does very aggressive performance optimizations, there is
>> more room for performance improvements due to profile approximations while
>> applying the transformations.  BOLT uses exact profiles from the final binary
>> and is able to fill the gaps left by ThinLTO + PGO. The performance
>> improvements due to BOLT come from basic block layout, function reordering and
>> function splitting.
>>
>> While BOLT does an excellent job of squeezing extra performance from highly
>> optimized binaries with optimizations such as code layout, it has these major
>> issues:
>>
>>  * It does not take advantage of distributed build systems.
>>  * It has scalability issues and to rewrite a binary with a ~300M text segment
>> size:
>>  * Memory foot-print is 70G.
>>  * It takes more than 10 minutes to rewrite the binary.
>>
>> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
>> original binary, optimizes and rewrites the final binary in one process.  This
>> limits the scalability of BOLT and the memory and time overhead shoots up
>> quickly for large binaries.
>>
>> Inspired by the performance gains and to address the scalability issue of BOLT,
>> we went about designing a scalable infrastructure that can perform BOLT-like
>> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
>> which can perform profile guided link time binary optimizations in a scalable
>> way and is friendly to distributed build systems.  Our system leverages the
>> existing capabilities of the compiler tool-chain and is not a stand alone tool.
>> Like BOLT, our system boosts the performance of optimized binaries via
>> link-time optimizations using accurate profiles of the binary. We discuss the
>> Propeller system and show how to do the whole program basic block layout using
>> Propeller.
>>
>> Propeller does whole program basic block layout at link time via basic block
>> sections.  We have added support for having each basic block in its own section
>> which allows the linker to do arbitrary reorderings of basic blocks to achieve
>> any desired fine-grain code layout which includes block layout, function
>> splitting and function reordering.  Our experiments on large real-world
>> applications and SPEC with code layout show that Propeller can optimize as
>> effectively as BOLT, with just 20% of its memory footprint and time overhead.
>>
>> An LLVM branch with propeller patches is available in the git repository here:
>> https://github.com/google/llvm-propeller/  We will upload individual patches of
>> the various elements for review.  We have attached a google doc describing the
>> Propeller system with Experiments in detail,
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>>
>> Quick Start - How to optimize further with Propeller?
>>
>> # git clone and build repo
>>
>> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git
>>
>> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>>
>> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
>>    $LLVM_DIR/llvm-propeller/llvm
>>
>> $ ninja -j25
>>
>> $ export PATH=$BUILD_DIR/bin:$PATH
>>
>>
>> # Let’s Propeller-optimize the following program:
>>
>>
>> # Step 1: Build the peak optimized binary with an additional flag.
>>
>> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld
>>
>> # Step 2: Profile the binary, only one side of the branch is executed.
>> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null
>>
>>
>> # Step 3: Convert the profiles using the tool provided
>> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>>
>>
>> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
>> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld
>>
>> In Step 4, the optimized bit code can be used if it is saved in Step1 as
>> Propeller is active only during compile backend and link.  The optimized binary
>> has a different layout of the basic blocks in main to keep the executed blocks
>> together and split the cold blocks.
>>
>> Some of the key points:
>>
>> +  Added support for basic block sections, similar to function sections, where
>> each basic block can reside in its own section.
>>
>> +  Jump instructions need 32-bit relocations and subsequent linker relaxations
>> after basic block layout.  We would like to add a new relocation type for jump
>> instructions to make it easier for relaxations and guarantee correctness.
>>
>> +  Added support in the linker to read profiles (PMU LBR) and discover optimal
>> basic block layout using the Ex-TSP algorithm described here:
>> https://arxiv.org/abs/1809.04676
>>
>> +  Added support in the linker to re-order basic block sections in any
>> specified sequence (can be done with symbol ordering file).  This requires
>> linker relaxations to delete and shrink branches across basic blocks.
>>
>> +  Compared our system to BOLT  and have shown that our system can produce
>> similar performance improvements as BOLT but with much less memory and time
>> overheads.  Our experiments are on very large warehouse-scale benchmarks and
>> SPEC 2017.
>>
>> +  Listed why this cannot be done as part of PGO itself.  Post Link
>> optimizations are able to transform the binary using accurate profiles and PGO
>> passes suffer from profile imprecision.
>>
>> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
>> via basic block sections.
>>
>> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
>> binary sizes much more than DebugInfo due to lack of support for discontiguous
>> address ranges.  We have talked about this and would like to make a case to
>> support discontiguous ranges with CFI which will make basic block sections much
>> more cheaper.
>>
>> Detailed RFC document here :
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>> Please let us know what you think,
>> Thanks
>> Sri on behalf of the team.
>
>
> The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?

We looked at the code and it looks like the function reordering parts
are using the hfsort algorithm, both  CallgraphSort and Propeller.
So, we can merge these.  When callgraphsort option is passed,
Propeller  will invoke the hfsort code to do function layout.  We
prefer doing this as a follow-up patch if that is alright as this
requires more plumbing.

>
> I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.
>
> - Michael Spencer
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev

 

From: Xinliang David Li <[hidden email]>
Sent: Wednesday, September 25, 2019 5:58 PM
To: Eli Friedman <[hidden email]>
Cc: Sriraman Tallam <[hidden email]>; llvm-dev <[hidden email]>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

 

 

 

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <[hidden email]> wrote:

My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. 

 

This was considered for Propeller.  This  is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.

 

Makes sense.

 

And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description? 

 

One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.

 

Okay, so my suggestion doesn’t work. I’m still concerned the proposed design is going to push us in a direction we don’t want to go.  Particularly, if you’re going to attempt more complicated transforms, the problems caused by the limited information available in an ELF file will become more prominent.  I mean, yes, you can come up with a more complicated implicit contract between the compiler and Propeller about the exact format of Propeller basic blocks, and add additional symbol annotations, and eventually come up with an “IR” that allows Propeller to perform arbitrary code transforms.  But that’s inevitably going to be more complicated, and harder to understand, than a compiler IR designed for optimizations.

 

If we are going to stick with ELF-as-compiler-IR, I’d like to see more careful documentation of the contract between the compiler and the Propeller linker, so it’s clear what the compiler is/is not allowed to emit at compile-time, and so other compilers could be made compatible in the future.  The current Propeller implementation is doing some code rewrites that can’t be justified by just standard ELF section reordering.

 

-Eli


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <[hidden email]> wrote:

>
>
>
> From: Xinliang David Li <[hidden email]>
> Sent: Wednesday, September 25, 2019 5:58 PM
> To: Eli Friedman <[hidden email]>
> Cc: Sriraman Tallam <[hidden email]>; llvm-dev <[hidden email]>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
>
>
>
>
>
>
>
> On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <[hidden email]> wrote:
>
> My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.
>
>
>
> This was considered for Propeller.  This  is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.
>
>
>
> Makes sense.
>
>
>
> And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?
>
>
>
> One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.
>
>
>
> Okay, so my suggestion doesn’t work. I’m still concerned the proposed design is going to push us in a direction we don’t want to go.  Particularly, if you’re going to attempt more complicated transforms, the problems caused by the limited information available in an ELF file will become more prominent.  I mean, yes, you can come up with a more complicated implicit contract between the compiler and Propeller about the exact format of Propeller basic blocks, and add additional symbol annotations, and eventually come up with an “IR” that allows Propeller to perform arbitrary code transforms.  But that’s inevitably going to be more complicated, and harder to understand, than a compiler IR designed for optimizations.

Thanks for the feedback, I am not sure I fully understand your
concerns but let me try to make some of the things clearer:

*  Propeller relinks. Specifically, it regenerates ELF object files
from MIR.  Even if MIR were serializable, we would still be starting
before CFI instruction inserter pass and then regenerate the native
ELF objects.
*  All the code transformations we are planning to do for futuristic
optimizations is on the MIR just like you noted in a previous email.
For example, prefetch insertion was one optimization we were looking
at where the input is inserting prefetches at specific points in the
binary which we will translate to basic blocks at MIR and then insert
them there.  So, I don't think we will be limited by ELF files.
* In comparison with BOLT, BOLT disassembles to MCInst to apply some
of these optimizations and we get to start from MIR without the cost
of disassembly.
* I think your concern is coming from looking at the linker
relaxations we perform for X86_64 and wondering if this would scale
for RISC, etc.  We have not looked at RISC but chatting with Rui
(ruiu@) briefly, this looked like it is doable even without using
thunks.

Thanks
Sri

>
>
>
> If we are going to stick with ELF-as-compiler-IR, I’d like to see more careful documentation of the contract between the compiler and the Propeller linker, so it’s clear what the compiler is/is not allowed to emit at compile-time, and so other compilers could be made compatible in the future.  The current Propeller implementation is doing some code rewrites that can’t be justified by just standard ELF section reordering.
>
>
>
> -Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
> -----Original Message-----
> From: Sriraman Tallam <[hidden email]>
> Sent: Thursday, September 26, 2019 3:24 PM
> To: Eli Friedman <[hidden email]>
> Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
> On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <[hidden email]> wrote:
> >
> >
> >
> > From: Xinliang David Li <[hidden email]>
> > Sent: Wednesday, September 25, 2019 5:58 PM
> > To: Eli Friedman <[hidden email]>
> > Cc: Sriraman Tallam <[hidden email]>; llvm-dev <llvm-
> [hidden email]>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-
> [hidden email]> wrote:
> >
> > My biggest question about this architecture is about when propeller runs basic
> block reordering within a function.  It seems like a lot of the complexity comes
> from using the proposed -fbasicblock-sections to generated mangled ELF, and
> then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right
> approach, long-term.
> >
> > Splitting every basic block into its own section introduces overhead, like you
> described.  And it's likely more complex on non-x86 targets, which have a
> greater variety of conditional branches.  And the reordering itself involves a
> bunch of x86 and ELF-specific code.
> >
> > I'd like to suggest an alternative: instead of perform basic block reordering and
> function splitting by manipulating the ELF files, you could perform reordering
> and splitting as part of link-time code generation, as an MIR pass that runs just
> before the assembly printer.  MIR is almost exactly the form you want for this
> sort of manipulation: it has basic blocks which correspond closely to the final
> binary, and a high-level representation of branch instructions.
> >
> >
> >
> > This was considered for Propeller.  This  is currently being explored in a similar
> way as an alternative of CSFDO which uses PMU samples.
> >
> >
> >
> > Makes sense.
> >
> >
> >
> > And it's before the DWARF/CFI emission, so you don't need to worry about
> fixing them afterwards.  This should take less code overall, and much less target-
> specific code. And infrastructure for function splitting would be useful for non-
> Propeller workflows.
> >
> > There are some minor downsides to this approach I can think of.  You lose a
> little flexibility, in that you can't mix blocks from different functions together,
> but you aren't doing that anyway, from your description?
> >
> >
> >
> > One of the main design objectives of Propeller is to have the capability to do
> interprocedural code transformations (reordering, cloning, dedupping etc), so
> this won't be a minor downside. Function/block alignment (for branch
> misprediction reduction etc) will also need to be done as a global optimization in
> the future.
> >
> >
> >
> > Okay, so my suggestion doesn’t work. I’m still concerned the proposed design
> is going to push us in a direction we don’t want to go.  Particularly, if you’re
> going to attempt more complicated transforms, the problems caused by the
> limited information available in an ELF file will become more prominent.  I mean,
> yes, you can come up with a more complicated implicit contract between the
> compiler and Propeller about the exact format of Propeller basic blocks, and add
> additional symbol annotations, and eventually come up with an “IR” that allows
> Propeller to perform arbitrary code transforms.  But that’s inevitably going to be
> more complicated, and harder to understand, than a compiler IR designed for
> optimizations.
>
> Thanks for the feedback, I am not sure I fully understand your
> concerns but let me try to make some of the things clearer:
>
> *  Propeller relinks. Specifically, it regenerates ELF object files
> from MIR.  Even if MIR were serializable, we would still be starting
> before CFI instruction inserter pass and then regenerate the native
> ELF objects.

Now I'm confused.  Why are you regenerating the ELF files?

I thought the workflow was something like the following:

1. The compiler (or LTO codegen) generates ELF files with basic block sections.
2. You link once without propeller optimizations.
3. You collect the propeller profile.
4. You use the same ELF files to relink with propeller optimizations.

That's sufficient to implement all the optimizations you described, as far as I can tell.

But it sounds like that's wrong?  It sounds like what you're describing is:

1. LTO codegen generates MIR files.
2. You convert the MIR files to ELF object files, and link without propeller optimizations.  (Does this step have any propeller-specific code generation differences?)
4. You collect the propeller profile.
5. You apply some propeller optimization to the MIR files.
6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.
7. You link the ELF files, applying propeller reordering optimizations.

(And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)

Is that correct?

If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?

-Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
On Thu, Sep 26, 2019 at 5:13 PM Eli Friedman <[hidden email]> wrote:

>
> > -----Original Message-----
> > From: Sriraman Tallam <[hidden email]>
> > Sent: Thursday, September 26, 2019 3:24 PM
> > To: Eli Friedman <[hidden email]>
> > Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> >
> > On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <[hidden email]> wrote:
> > >
> > >
> > >
> > > From: Xinliang David Li <[hidden email]>
> > > Sent: Wednesday, September 25, 2019 5:58 PM
> > > To: Eli Friedman <[hidden email]>
> > > Cc: Sriraman Tallam <[hidden email]>; llvm-dev <llvm-
> > [hidden email]>
> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-
> > [hidden email]> wrote:
> > >
> > > My biggest question about this architecture is about when propeller runs basic
> > block reordering within a function.  It seems like a lot of the complexity comes
> > from using the proposed -fbasicblock-sections to generated mangled ELF, and
> > then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right
> > approach, long-term.
> > >
> > > Splitting every basic block into its own section introduces overhead, like you
> > described.  And it's likely more complex on non-x86 targets, which have a
> > greater variety of conditional branches.  And the reordering itself involves a
> > bunch of x86 and ELF-specific code.
> > >
> > > I'd like to suggest an alternative: instead of perform basic block reordering and
> > function splitting by manipulating the ELF files, you could perform reordering
> > and splitting as part of link-time code generation, as an MIR pass that runs just
> > before the assembly printer.  MIR is almost exactly the form you want for this
> > sort of manipulation: it has basic blocks which correspond closely to the final
> > binary, and a high-level representation of branch instructions.
> > >
> > >
> > >
> > > This was considered for Propeller.  This  is currently being explored in a similar
> > way as an alternative of CSFDO which uses PMU samples.
> > >
> > >
> > >
> > > Makes sense.
> > >
> > >
> > >
> > > And it's before the DWARF/CFI emission, so you don't need to worry about
> > fixing them afterwards.  This should take less code overall, and much less target-
> > specific code. And infrastructure for function splitting would be useful for non-
> > Propeller workflows.
> > >
> > > There are some minor downsides to this approach I can think of.  You lose a
> > little flexibility, in that you can't mix blocks from different functions together,
> > but you aren't doing that anyway, from your description?
> > >
> > >
> > >
> > > One of the main design objectives of Propeller is to have the capability to do
> > interprocedural code transformations (reordering, cloning, dedupping etc), so
> > this won't be a minor downside. Function/block alignment (for branch
> > misprediction reduction etc) will also need to be done as a global optimization in
> > the future.
> > >
> > >
> > >
> > > Okay, so my suggestion doesn’t work. I’m still concerned the proposed design
> > is going to push us in a direction we don’t want to go.  Particularly, if you’re
> > going to attempt more complicated transforms, the problems caused by the
> > limited information available in an ELF file will become more prominent.  I mean,
> > yes, you can come up with a more complicated implicit contract between the
> > compiler and Propeller about the exact format of Propeller basic blocks, and add
> > additional symbol annotations, and eventually come up with an “IR” that allows
> > Propeller to perform arbitrary code transforms.  But that’s inevitably going to be
> > more complicated, and harder to understand, than a compiler IR designed for
> > optimizations.
> >
> > Thanks for the feedback, I am not sure I fully understand your
> > concerns but let me try to make some of the things clearer:
> >
> > *  Propeller relinks. Specifically, it regenerates ELF object files
> > from MIR.  Even if MIR were serializable, we would still be starting
> > before CFI instruction inserter pass and then regenerate the native
> > ELF objects.
>
> Now I'm confused.  Why are you regenerating the ELF files?

TLDR;  We are regenerating ELF files from optimized IR to keep the
cost of generating basic block sections low.

If what you describe above is done, where we generate basic block
sections even before we profile, we must conservatively generate
sections for all functions.  This is unnecessarily expensive and we
have shown that generating it on demand based on profiles is much
cheaper.  Since the backend generation can be distributed or
parallelized, this is not expensive to do.    If we could make the
cost of basic block sections in terms of size bloats cheaper we could
do what you just suggested here, that is, always build with basic
block sections for all basic blocks.

>
> I thought the workflow was something like the following:
>
> 1. The compiler (or LTO codegen) generates ELF files with basic block sections.

The correction here is that we generate first with basic block labels
and not sections.

> 2. You link once without propeller optimizations.

This is correct.

> 3. You collect the propeller profile.

This is correct.

> 4. You use the same ELF files to relink with propeller optimizations.

We use the same optimized IR files to regenerate the ELF files.  We
could use optimized MIR here but serializability of MIR is not well
supported yet and is part of the larger plan.  We only need to re-run
CFI instruction inserter in MIR.  For future optimizations, we can
plug in here and make small code transformations for optimizations
like prefetch insertion.  If we attempt to do basic block re-ordering
here, we would be limited to intra-procedural basic block reordering
which is sub-optimal.  Hence, we generate basic block sections here
only for the sampled functions which is a small %age.

>
> That's sufficient to implement all the optimizations you described, as far as I can tell.
>
> But it sounds like that's wrong?  It sounds like what you're describing is:
>
> 1. LTO codegen generates MIR files.
> 2. You convert the MIR files to ELF object files, and link without propeller optimizations.  (Does this step have any propeller-specific code generation differences?)

Yes, like I said above, basic block labels are used here to identify
the basic blocks which are later used in the last step to annotate
profiles.

> 4. You collect the propeller profile.
> 5. You apply some propeller optimization to the MIR files.
> 6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.

Right, technically this is high level IR now as MIR is not
serializable but this is ok for discussion.

> 7. You link the ELF files, applying propeller reordering optimizations.
>
> (And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)
>
> Is that correct?

That is correct.

>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR is still one module at a time.  We cannot do inter-procedural
basic block layout here.  We can do much more advanced stuff at the
whole-program level in the linker.  The relaxation code is the down
side.

For more details, We strongly considered this.  We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant  subsets of the global decision  to each module.  This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker.  This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

>
> Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?

Where do you suggest labelling and section options should exist?  We
looked at  basic block sections to be similar to function sections in
terms of option handling?

Thanks
Sri

>
> -Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
> -----Original Message-----
> From: Sriraman Tallam <[hidden email]>
> Sent: Thursday, September 26, 2019 5:31 PM
> To: Eli Friedman <[hidden email]>
> Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
> For more details, We strongly considered this.  We could run something
> like a thin link in thin lto figure out the global layout and hand out
> the relevant  subsets of the global decision  to each module.  This
> looked more complicated and the individual pieces from each module
> should still be globally laid out again by the linker.  This
> constraints us on what we can do for layout and also does not work
> well with future optimizations like global alignment like David
> pointed out.

Basically, you need to do branch relaxation etc. in the linker because the layout isn't sufficiently finalized until then?  And you *only* need to rewrite branches this way; never anything else, because other instructions don't need relaxation?

On most targets, we do branch relaxation in the compiler, unlike x86 which does it in the assembler.  I guess we could teach the compiler to emit some specific "relaxed" pattern, and teach the linker to reverse it.  Might be a little tricky to handle certain odd cases well; currently, for example, on AArch64 we sometimes insert a cpsr clobber as part of relaxation.  I'm not really happy with having to reimplement that for every target, but if we're sure it's limited to branches, I guess it's okay.

> > Why are you proposing to add a bunch of options to clang to manipulate LLVM
> code generation, given none of those options are actually used for Propeller
> workflows?
>
> Where do you suggest labelling and section options should exist?  We
> looked at  basic block sections to be similar to function sections in
> terms of option handling?

Generating bitcode with/without propeller doesn't actually affect the generated bitcode, right?  So you could say that Propeller is enabled with "-Wl,--enable-propeller", and not change clang at all.  I'm not a fan of adding options just because we can.  If basic block sections are only used as a sort of secret handshake between the propeller compiler and the propeller linker, we can change that interface later, if we want; if we expose it as a clang option, we're committing to making basic block sections continue to work the same way until the end of time.

-Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev


On Thu, Sep 26, 2019 at 5:31 PM Sriraman Tallam <[hidden email]> wrote:
On Thu, Sep 26, 2019 at 5:13 PM Eli Friedman <[hidden email]> wrote:
>
> > -----Original Message-----
> > From: Sriraman Tallam <[hidden email]>
> > Sent: Thursday, September 26, 2019 3:24 PM
> > To: Eli Friedman <[hidden email]>
> > Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> >
> > On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <[hidden email]> wrote:
> > >
> > >
> > >
> > > From: Xinliang David Li <[hidden email]>
> > > Sent: Wednesday, September 25, 2019 5:58 PM
> > > To: Eli Friedman <[hidden email]>
> > > Cc: Sriraman Tallam <[hidden email]>; llvm-dev <llvm-
> > [hidden email]>
> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-
> > [hidden email]> wrote:
> > >
> > > My biggest question about this architecture is about when propeller runs basic
> > block reordering within a function.  It seems like a lot of the complexity comes
> > from using the proposed -fbasicblock-sections to generated mangled ELF, and
> > then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right
> > approach, long-term.
> > >
> > > Splitting every basic block into its own section introduces overhead, like you
> > described.  And it's likely more complex on non-x86 targets, which have a
> > greater variety of conditional branches.  And the reordering itself involves a
> > bunch of x86 and ELF-specific code.
> > >
> > > I'd like to suggest an alternative: instead of perform basic block reordering and
> > function splitting by manipulating the ELF files, you could perform reordering
> > and splitting as part of link-time code generation, as an MIR pass that runs just
> > before the assembly printer.  MIR is almost exactly the form you want for this
> > sort of manipulation: it has basic blocks which correspond closely to the final
> > binary, and a high-level representation of branch instructions.
> > >
> > >
> > >
> > > This was considered for Propeller.  This  is currently being explored in a similar
> > way as an alternative of CSFDO which uses PMU samples.
> > >
> > >
> > >
> > > Makes sense.
> > >
> > >
> > >
> > > And it's before the DWARF/CFI emission, so you don't need to worry about
> > fixing them afterwards.  This should take less code overall, and much less target-
> > specific code. And infrastructure for function splitting would be useful for non-
> > Propeller workflows.
> > >
> > > There are some minor downsides to this approach I can think of.  You lose a
> > little flexibility, in that you can't mix blocks from different functions together,
> > but you aren't doing that anyway, from your description?
> > >
> > >
> > >
> > > One of the main design objectives of Propeller is to have the capability to do
> > interprocedural code transformations (reordering, cloning, dedupping etc), so
> > this won't be a minor downside. Function/block alignment (for branch
> > misprediction reduction etc) will also need to be done as a global optimization in
> > the future.
> > >
> > >
> > >
> > > Okay, so my suggestion doesn’t work. I’m still concerned the proposed design
> > is going to push us in a direction we don’t want to go.  Particularly, if you’re
> > going to attempt more complicated transforms, the problems caused by the
> > limited information available in an ELF file will become more prominent.  I mean,
> > yes, you can come up with a more complicated implicit contract between the
> > compiler and Propeller about the exact format of Propeller basic blocks, and add
> > additional symbol annotations, and eventually come up with an “IR” that allows
> > Propeller to perform arbitrary code transforms.  But that’s inevitably going to be
> > more complicated, and harder to understand, than a compiler IR designed for
> > optimizations.
> >
> > Thanks for the feedback, I am not sure I fully understand your
> > concerns but let me try to make some of the things clearer:
> >
> > *  Propeller relinks. Specifically, it regenerates ELF object files
> > from MIR.  Even if MIR were serializable, we would still be starting
> > before CFI instruction inserter pass and then regenerate the native
> > ELF objects.
>
> Now I'm confused.  Why are you regenerating the ELF files?

TLDR;  We are regenerating ELF files from optimized IR to keep the
cost of generating basic block sections low.

If what you describe above is done, where we generate basic block
sections even before we profile, we must conservatively generate
sections for all functions.  This is unnecessarily expensive and we
have shown that generating it on demand based on profiles is much
cheaper.  Since the backend generation can be distributed or
parallelized, this is not expensive to do.    If we could make the
cost of basic block sections in terms of size bloats cheaper we could
do what you just suggested here, that is, always build with basic
block sections for all basic blocks.

>
> I thought the workflow was something like the following:
>
> 1. The compiler (or LTO codegen) generates ELF files with basic block sections.

The correction here is that we generate first with basic block labels
and not sections.

> 2. You link once without propeller optimizations.

This is correct.

> 3. You collect the propeller profile.

This is correct.

> 4. You use the same ELF files to relink with propeller optimizations.

We use the same optimized IR files to regenerate the ELF files.  We
could use optimized MIR here but serializability of MIR is not well
supported yet and is part of the larger plan.  We only need to re-run
CFI instruction inserter in MIR.  For future optimizations, we can
plug in here and make small code transformations for optimizations
like prefetch insertion.  If we attempt to do basic block re-ordering
here, we would be limited to intra-procedural basic block reordering
which is sub-optimal.  Hence, we generate basic block sections here
only for the sampled functions which is a small %age.

>
> That's sufficient to implement all the optimizations you described, as far as I can tell.
>
> But it sounds like that's wrong?  It sounds like what you're describing is:
>
> 1. LTO codegen generates MIR files.
> 2. You convert the MIR files to ELF object files, and link without propeller optimizations.  (Does this step have any propeller-specific code generation differences?)

Yes, like I said above, basic block labels are used here to identify
the basic blocks which are later used in the last step to annotate
profiles.

> 4. You collect the propeller profile.
> 5. You apply some propeller optimization to the MIR files.
> 6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.

Right, technically this is high level IR now as MIR is not
serializable but this is ok for discussion.

> 7. You link the ELF files, applying propeller reordering optimizations.
>
> (And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)
>
> Is that correct?

That is correct.

>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR level transformations are not excluded and can be done in Propeller framework, just limited to module level optimizations. Anything Global/Interprocedural needs to happen at (real) link time with assistance of module level annotations.
 

MIR is still one module at a time.  We cannot do inter-procedural
basic block layout here.  We can do much more advanced stuff at the
whole-program level in the linker.  The relaxation code is the down
side.

For more details, We strongly considered this.  We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant  subsets of the global decision  to each module.  This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker. 

This won't work well -- as cross module inlining has not yet happened. Not only will there be problems with profile annotation, all the profile context sensitives will be lost.
 
This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

>
> Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?


Propeller workflows include precise profile annotations, so the options are used there.

David

 
Where do you suggest labelling and section options should exist?  We
looked at  basic block sections to be similar to function sections in
terms of option handling?

Thanks
Sri

>
> -Eli

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev


Op vr 27 sep. 2019 om 02:32 schreef Sriraman Tallam via llvm-dev <[hidden email]>:
>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR is still one module at a time.  We cannot do inter-procedural
basic block layout here.  We can do much more advanced stuff at the
whole-program level in the linker.  The relaxation code is the down
side.

For more details, We strongly considered this.  We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant  subsets of the global decision  to each module.  This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker.  This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

Apologies for the naive question.
Why is MIR restricted to being only module at a time?
If the restriction of only being able to do per-module processing at the MIR level would go away, then all these optimizations could be done in a compile step, and no support would need to be added to a linker?
If the restriction to do cross-module optimization at the MIR level could be removed, would it become a better tradeoff to do this in an LTO compiler step rather than a linker step?

Thanks,

Kristof 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev


On Fri, Sep 27, 2019 at 12:37 AM Kristof Beyls via llvm-dev <[hidden email]> wrote:


Op vr 27 sep. 2019 om 02:32 schreef Sriraman Tallam via llvm-dev <[hidden email]>:
>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR is still one module at a time.  We cannot do inter-procedural
basic block layout here.  We can do much more advanced stuff at the
whole-program level in the linker.  The relaxation code is the down
side.

For more details, We strongly considered this.  We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant  subsets of the global decision  to each module.  This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker.  This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

Apologies for the naive question.
Why is MIR restricted to being only module at a time?
If the restriction of only being able to do per-module processing at the MIR level would go away, then all these optimizations could be done in a compile step, and no support would need to be added to a linker?
If the restriction to do cross-module optimization at the MIR level could be removed, would it become a better tradeoff to do this in an LTO compiler step rather than a linker step?

LTO style optimization uses the monolithic model which Propeller moves away from. The design principle is as much as possible preparation work will be done at module level, while the global step is lean and mean.

For layout/splitting/alignment/packing, linker is the right platform to use. It sees/creates many things compiler does not, such as PLTs/stubs/trampolines. etc. Besides, using linker allows inter-procedural reordering of blocks in native object files as long as those files are compiled with the right annotations. For instance, we can do per application optimal layout for memcpy/memset functions from libc.a depending on the profile data for that app.

David

 

Thanks,

Kristof 
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
In reply to this post by Jeremy Morse via llvm-dev
On Thu, Sep 26, 2019 at 6:16 PM Eli Friedman <[hidden email]> wrote:

>
> > -----Original Message-----
> > From: Sriraman Tallam <[hidden email]>
> > Sent: Thursday, September 26, 2019 5:31 PM
> > To: Eli Friedman <[hidden email]>
> > Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> >
> > For more details, We strongly considered this.  We could run something
> > like a thin link in thin lto figure out the global layout and hand out
> > the relevant  subsets of the global decision  to each module.  This
> > looked more complicated and the individual pieces from each module
> > should still be globally laid out again by the linker.  This
> > constraints us on what we can do for layout and also does not work
> > well with future optimizations like global alignment like David
> > pointed out.
>
> Basically, you need to do branch relaxation etc. in the linker because the layout isn't sufficiently finalized until then?

Yes, this is correct.  Relaxation is needed because the layout is not
finalized until link time.




And you *only* need to rewrite branches this way; never anything else,
because other instructions don't need relaxation?

I cannot think of any other instance where such relaxation is needed.

>
> On most targets, we do branch relaxation in the compiler, unlike x86 which does it in the assembler.  I guess we could teach the compiler to emit some specific "relaxed" pattern, and teach the linker to reverse it.  Might be a little tricky to handle certain odd cases well; currently, for example, on AArch64 we sometimes insert a cpsr clobber as part of relaxation.  I'm not really happy with having to reimplement that for every target, but if we're sure it's limited to branches, I guess it's okay.

Understood.

>
> > > Why are you proposing to add a bunch of options to clang to manipulate LLVM
> > code generation, given none of those options are actually used for Propeller
> > workflows?
> >
> > Where do you suggest labelling and section options should exist?  We
> > looked at  basic block sections to be similar to function sections in
> > terms of option handling?
>
> Generating bitcode with/without propeller doesn't actually affect the generated bitcode, right?  So you could say that Propeller is enabled with "-Wl,--enable-propeller", and not change clang at all.  I'm not a fan of adding options just because we can.  If basic block sections are only used as a sort of secret handshake between the propeller compiler and the propeller linker, we can change that interface later, if we want; if we expose it as a clang option, we're committing to making basic block sections continue to work the same way until the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise.  So, some llvm option is
needed.

I envisioned that basic block sections could also be useful on its own
outside of propeller.   Hence, we made it like function-sections with
a separate option -fbasicblock-sections.  The propeller option is an
umbrella option to make it easy to invoke Propeller.

Thanks
Sri

>
> -Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
> -----Original Message-----
> From: Sriraman Tallam <[hidden email]>
> Sent: Friday, September 27, 2019 9:43 AM
> To: Eli Friedman <[hidden email]>
> Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
> >
> > > > Why are you proposing to add a bunch of options to clang to manipulate
> LLVM
> > > code generation, given none of those options are actually used for Propeller
> > > workflows?
> > >
> > > Where do you suggest labelling and section options should exist?  We
> > > looked at  basic block sections to be similar to function sections in
> > > terms of option handling?
> >
> > Generating bitcode with/without propeller doesn't actually affect the
> generated bitcode, right?  So you could say that Propeller is enabled with "-Wl,--
> enable-propeller", and not change clang at all.  I'm not a fan of adding options
> just because we can.  If basic block sections are only used as a sort of secret
> handshake between the propeller compiler and the propeller linker, we can
> change that interface later, if we want; if we expose it as a clang option, we're
> committing to making basic block sections continue to work the same way until
> the end of time.
>
> The generated MIR code is slightly different as the later passes have
> more CFI instructions, basic block labels and extra direct jumps which
> would have been fall throughs otherwise.  So, some llvm option is
> needed.

At link (LTO codegen) time, yes.  But clang's bitcode generation doesn't change; only LTO code generation in lld changes.

> I envisioned that basic block sections could also be useful on its own
> outside of propeller.   Hence, we made it like function-sections with
> a separate option -fbasicblock-sections.  The propeller option is an
> umbrella option to make it easy to invoke Propeller.

I don't think this is really true.  Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization".  So it's really only useful for propeller-like workflows.  And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.

In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.

-Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
On Fri, Sep 27, 2019 at 1:16 PM Eli Friedman <[hidden email]> wrote:

>
> > -----Original Message-----
> > From: Sriraman Tallam <[hidden email]>
> > Sent: Friday, September 27, 2019 9:43 AM
> > To: Eli Friedman <[hidden email]>
> > Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> >
> > >
> > > > > Why are you proposing to add a bunch of options to clang to manipulate
> > LLVM
> > > > code generation, given none of those options are actually used for Propeller
> > > > workflows?
> > > >
> > > > Where do you suggest labelling and section options should exist?  We
> > > > looked at  basic block sections to be similar to function sections in
> > > > terms of option handling?
> > >
> > > Generating bitcode with/without propeller doesn't actually affect the
> > generated bitcode, right?  So you could say that Propeller is enabled with "-Wl,--
> > enable-propeller", and not change clang at all.  I'm not a fan of adding options
> > just because we can.  If basic block sections are only used as a sort of secret
> > handshake between the propeller compiler and the propeller linker, we can
> > change that interface later, if we want; if we expose it as a clang option, we're
> > committing to making basic block sections continue to work the same way until
> > the end of time.
> >
> > The generated MIR code is slightly different as the later passes have
> > more CFI instructions, basic block labels and extra direct jumps which
> > would have been fall throughs otherwise.  So, some llvm option is
> > needed.
>
> At link (LTO codegen) time, yes.  But clang's bitcode generation doesn't change; only LTO code generation in lld changes.

I see what you mean here.

>
> > I envisioned that basic block sections could also be useful on its own
> > outside of propeller.   Hence, we made it like function-sections with
> > a separate option -fbasicblock-sections.  The propeller option is an
> > umbrella option to make it easy to invoke Propeller.
>
> I don't think this is really true.  Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization".  So it's really only useful for propeller-like workflows.  And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama.  When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections.  Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections.  Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller.  Function sections has found uses after it
was invented like Identical Code Folding.

>
> In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.

Understood.

Thanks
Sri

>
> -Eli
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Jeremy Morse via llvm-dev
On Fri, Sep 27, 2019 at 2:08 PM Sriraman Tallam via llvm-dev
<[hidden email]> wrote:

>
> On Fri, Sep 27, 2019 at 1:16 PM Eli Friedman <[hidden email]> wrote:
> >
> > > -----Original Message-----
> > > From: Sriraman Tallam <[hidden email]>
> > > Sent: Friday, September 27, 2019 9:43 AM
> > > To: Eli Friedman <[hidden email]>
> > > Cc: Xinliang David Li <[hidden email]>; llvm-dev <[hidden email]>
> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > > Optimizations
> > >
> > > >
> > > > > > Why are you proposing to add a bunch of options to clang to manipulate
> > > LLVM
> > > > > code generation, given none of those options are actually used for Propeller
> > > > > workflows?
> > > > >
> > > > > Where do you suggest labelling and section options should exist?  We
> > > > > looked at  basic block sections to be similar to function sections in
> > > > > terms of option handling?
> > > >
> > > > Generating bitcode with/without propeller doesn't actually affect the
> > > generated bitcode, right?  So you could say that Propeller is enabled with "-Wl,--
> > > enable-propeller", and not change clang at all.  I'm not a fan of adding options
> > > just because we can.  If basic block sections are only used as a sort of secret
> > > handshake between the propeller compiler and the propeller linker, we can
> > > change that interface later, if we want; if we expose it as a clang option, we're
> > > committing to making basic block sections continue to work the same way until
> > > the end of time.
> > >
> > > The generated MIR code is slightly different as the later passes have
> > > more CFI instructions, basic block labels and extra direct jumps which
> > > would have been fall throughs otherwise.  So, some llvm option is
> > > needed.
> >
> > At link (LTO codegen) time, yes.  But clang's bitcode generation doesn't change; only LTO code generation in lld changes.
>
> I see what you mean here.
>
> >
> > > I envisioned that basic block sections could also be useful on its own
> > > outside of propeller.   Hence, we made it like function-sections with
> > > a separate option -fbasicblock-sections.  The propeller option is an
> > > umbrella option to make it easy to invoke Propeller.
> >
> > I don't think this is really true.  Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization".  So it's really only useful for propeller-like workflows.  And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.
>
> The idea of basic block sections was seeded by Rui Ueyama.  When basic
> block sections was originally looked at, Propeller was not designed.
> We looked at basic block sections as finer granularity than function
> sections.  Section prefix based code layout where you just create
> smaller code sections and let the linker do what it does today would
> have much better results with basic block sections.  Symbol ordering
> file with basic block sections to do random orderings can be done
> without invoking Propeller.  Function sections has found uses after it
> was invented like Identical Code Folding.
>

I'm not sure that function sections are entirely an apt comparison
here for a few reasons:

 - function sections mostly gave an easy way to do ICF - using MIR or
LLVM IR are also fairly effective means as is just using the linker to
compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc's
recent work helps out a little bit via significant address
determination in the front end
 - order files and other similar mechanisms work based on externally
visible symbols and are fairly straightforward to use in an ABI safe
way (no function -> function fallthrough for example)
 - basic block sections can run into the multiple entry point problems
depending on how they're laid out as well as other ABI and visibility
issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

-eric


> >
> > In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.
>
> Understood.
>
> Thanks
> Sri
>
> >
> > -Eli
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
123