RFC: ThinLTO Impementation Plan

classic Classic list List threaded Threaded
71 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

RFC: ThinLTO Impementation Plan

Teresa Johnson
I've included below an RFC for implementing ThinLTO in LLVM, looking
forward to feedback and questions.
Thanks!
Teresa



RFC to discuss plans for implementing ThinLTO upstream. Background can
be found in slides from EuroLLVM 2015:
   https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
As described in the talk, we have a prototype implementation, and
would like to start staging patches upstream. This RFC describes a
breakdown of the major pieces. We would like to commit upstream
gradually in several stages, with all functionality off by default.
The core ThinLTO importing support and tuning will require frequent
change and iteration during testing and tuning, and for that part we
would like to commit rapidly (off by default). See the proposed staged
implementation described in the Implementation Plan section.


ThinLTO Overview
==============

See the talk slides linked above for more details. The following is a
high-level overview of the motivation.

Cross Module Optimization (CMO) is an effective means for improving
runtime performance, by extending the scope of optimizations across
source module boundaries. Without CMO, the compiler is limited to
optimizing within the scope of single source modules. Two solutions
for enabling CMO are Link-Time Optimization (LTO), which is currently
supported in LLVM and GCC, and Lightweight-Interprocedural
Optimization (LIPO). However, each of these solutions has limitations
that prevent it from being enabled by default. ThinLTO is a new
approach that attempts to address these limitations, with a goal of
being enabled more broadly. ThinLTO is designed with many of the same
principals as LIPO, and therefore its advantages, without any of its
inherent weakness. Unlike in LIPO where the module group decision is
made at profile training runtime, ThinLTO makes the decision at
compile time, but in a lazy mode that facilitates large scale
parallelism. The serial linker plugin phase is designed to be razor
thin and blazingly fast. By default this step only does minimal
preparation work to enable the parallel lazy importing performed
later. ThinLTO aims to be scalable like a regular O2 build, enabling
CMO on machines without large memory configurations, while also
integrating well with distributed build systems. Results from early
prototyping on SPEC cpu2006 C++ benchmarks are in line with
expectations that ThinLTO can scale like O2 while enabling much of the
CMO performed during a full LTO build.


A ThinLTO build is divided into 3 phases, which are referred to in the
following implementation plan:

phase-1: IR and Function Summary Generation (-c compile)
phase-2: Thin Linker Plugin Layer (thin archive linker step)
phase-3: Parallel Backend with Demand-Driven Importing


Implementation Plan
================

This section gives a high-level breakdown of the ThinLTO support that
will be added, in roughly the order that the patches would be staged.
The patches are divided into three stages. The first stage contains a
minimal amount of preparation work that is not ThinLTO-specific. The
second stage contains most of the infrastructure for ThinLTO, which
will be off by default. The third stage includes
enhancements/improvements/tunings that can be performed after the main
ThinLTO infrastructure is in.

The second and third implementation stages will initially be very
volatile, requiring a lot of iterations and tuning with large apps to
get stabilized. Therefore it will be important to do fast commits for
these implementation stages.


1. Stage 1: Preparation
-------------------------------

The first planned sets of patches are enablers for ThinLTO work:


a. LTO directory structure:

Restructure the LTO directory to remove circular dependence when
ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
within Transforms/IPO, and leverages the LTOModule class for linking
in functions from modules, IPO then requires the LTO library. This
creates a circular dependence between LTO and IPO. To break that, we
need to split the lib/LTO directory/library into lib/LTO/CodeGen and
lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
respectively. Only LTOCodeGenerator has a dependence on IPO, removing
the circular dependence.


b. ELF wrapper generation support:

Implement ELF wrapped bitcode writer. In order to more easily interact
with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
bitcode wrapped in ELF via the .llvmbc section, along with a symbol
table. The goal is both to interact with these tools without requiring
a plugin, and also to avoid doing partial LTO/ThinLTO across files
linked with “$LD -r” (i.e. the resulting object file should still
contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
I will send a separate design document for these changes, but the
following is a high-level overview.

Support was added to LLVM for reading ELF-wrapped bitcode
(http://reviews.llvm.org/rL218078), but there does not yet exist
support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
add support for optionally generating bitcode in an ELF file
containing a single .llvmbc section holding the bitcode. Specifically,
the patch would add new options “emit-llvm-bc-elf” (object file) and
corresponding “emit-llvm-elf” (textual assembly code equivalent).
Eventually these would be automatically triggered under “-fthinlto -c”
and “-fthinlto -S”, respectively.

Additionally, a symbol table will be generated in the ELF file,
holding the function symbols within the bitcode. This facilitates
handling archives of the ELF-wrapped bitcode created with $AR, since
the archive will have a symbol table as well. The archive symbol table
enables gold to extract and pass to the plugin the constituent
ELF-wrapped bitcode files. To support the concatenated llvmbc section
generated by “$LD -r”, some handling needs to be added to gold and to
the backend driver to process each original module’s bitcode.

The function index/summary will later be added as a special ELF
section alongside the .llvmbc sections.


2. Stage 2: ThinLTO Infrastructure
----------------------------------------------

The next set of patches adds the base implementation of the ThinLTO
infrastructure, specifically those required to make ThinLTO functional
and generate correct but not necessarily high-performing binaries. It
also does not include support to make debug support under -g efficient
with ThinLTO.


a. Clang/LLVM/gold linker options:

An early set of clang/llvm patches is needed to provide options to
enable ThinLTO (off by default), so that the rest of the
implementation can be disabled by default as it is added.
Specifically, clang options -fthinlto (used instead of -flto) will
cause clang to invoke the phase-1 emission of LLVM bitcode and
function summary/index on a compile step, and pass the appropriate
option to the gold plugin on a link step. The -thinlto option will be
added to the gold plugin and llvm-lto tool to launch the phase-2 thin
archive step. The -thinlto option will also be added to the ‘opt’ tool
to invoke it as a phase-3 parallel backend instance.


b. Thin-archive linking support in Gold plugin and llvm-lto:

Under the new plugin option (see above), the plugin needs to perform
the phase-2 (thin archive) link which simply emits a combined function
map from the linked modules, without actually performing the normal
link. Corresponding support should be added to the standalone llvm-lto
tool to enable testing/debugging without involving the linker and
plugin.


c. ThinLTO backend support:

Support for invoking a phase-3 backend invocation (including
importing) on a module should be added to the ‘opt’ tool under the new
option. The main change under the option is to instantiate a Linker
object used to manage the process of linking imported functions into
the module, efficient read of the combined function map, and enable
the ThinLTO import pass.


d. Function index/summary support:

This includes infrastructure for writing and reading the function
index/summary section. As noted earlier this will be encoded in a
special ELF section within the module, alongside the .llvmbc section
containing the bitcode. The thin archive generated by phase-2 of
ThinLTO simply contains all of the function index/summary sections
across the linked modules, organized for efficient function lookup.

Each function available for importing from the module contains an
entry in the module’s function index/summary section and in the
resulting combined function map. Each function entry contains that
function’s offset within the bitcode file, used to efficiently locate
and quickly import just that function. The entry also contains summary
information (e.g. basic information determined during parsing such as
the number of instructions in the function), that will be used to help
guide later import decisions. Because the contents of this section
will change frequently during ThinLTO tuning, it should also be marked
with a version id for backwards compatibility or version checking.


e. ThinLTO importing support:

Support for the mechanics of importing functions from other modules,
which can go in gradually as a set of patches since it will be off by
default. Separate patches can include:

- BitcodeReader changes to use function index to import/deserialize
single function of interest (small changes, leverages existing lazy
streamer support).

- Minor LTOModule changes to pass the ThinLTO function to import and
its index into bitcode reader.

- Marking of imported functions (for use in ThinLTO-specific symbol
linking and global DCE, for example). This can be in-memory initially,
but IR support may be required in order to support streaming bitcode
out and back in again after importing.

- ModuleLinker changes to do ThinLTO-specific symbol linking and
static promotion when necessary. The linkage type of imported
functions changes to AvailableExternallyLinkage, for example. Statics
must be promoted in certain cases, and renamed in consistent ways.

- GlobalDCE changes to support removing imported functions that were
not inlined (very small changes to existing pass logic).


f. ThinLTO Import Driver SCC pass:

Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
an SCC pass, enabled only under -fthinlto options. The pass includes
utilizing the thin archive (global function index/summary), import
decision heuristics, invocation of LTOModule/ModuleLinker routines
that perform the import, and any necessary callgraph updates and
verification.


g. Backend Driver:

For a single node build, the gold plugin can simply write a makefile
and fork the parallel backend instances directly via parallel make.


3. Stage 3: ThinLTO Tuning and Enhancements
----------------------------------------------------------------

This refers to the patches that are not required for ThinLTO to work,
but rather to improve compile time, memory, run-time performance and
usability.


a. Lazy Debug Metadata Linking:

The prototype implementation included lazy importing of module-level
metadata during the ThinLTO pass finalization (i.e. after all function
importing is complete). This actually applies to all module-level
metadata, not just debug, although it is the largest. This can be
added as a separate set of patches. Changes to BitcodeReader,
ValueMapper, ModuleLinker


b. Import Tuning:

Tuning the import strategy will be an iterative process that will
continue to be refined over time. It involves several different types
of changes: adding support for recording additional metrics in the
function summary, such as profile data and optional heavier-weight IPA
analyses, and tuning the import heuristics based on the summary and
callsite context.


c. Combined Function Map Pruning:

The combined function map can be pruned of functions that are unlikely
to benefit from being imported. For example, during the phase-2 thin
archive plug step we can safely omit large and (with profile data)
cold functions, which are unlikely to benefit from being inlined.
Additionally, all but one copy of comdat functions can be suppressed.


d. Distributed Build System Integration:

For a distributed build system, the gold plugin should write the
parallel backend invocations into a makefile, including the mapping
from the IR file to the real object file path, and exit. Additional
work needs to be done in the distributed build system itself to
distribute and dispatch the parallel backend jobs to the build
cluster.


e. Dependence Tracking and Incremental Compiles:

In order to support build systems that stage from local disks or
network storage, the plugin will optionally support computation of
dependent sets of IR files that each module may import from. This can
be computed from profile data, if it exists, or from the symbol table
and heuristics if not. These dependence sets also enable support for
incremental backend compiles.



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Alex Rosenberg
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Xinliang David Li-2


On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]> wrote:
"ELF-wrapped bitcode" seems potentially controversial to me.

What about ar, nm, and various ld implementations adds this requirement? What about the LLVM implementations of these tools is lacking?

Sorry I can not parse your questions properly. Can you make it clearer?

David
 

Alex

> On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]> wrote:
>
> I've included below an RFC for implementing ThinLTO in LLVM, looking
> forward to feedback and questions.
> Thanks!
> Teresa
>
>
>
> RFC to discuss plans for implementing ThinLTO upstream. Background can
> be found in slides from EuroLLVM 2015:
>   https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> As described in the talk, we have a prototype implementation, and
> would like to start staging patches upstream. This RFC describes a
> breakdown of the major pieces. We would like to commit upstream
> gradually in several stages, with all functionality off by default.
> The core ThinLTO importing support and tuning will require frequent
> change and iteration during testing and tuning, and for that part we
> would like to commit rapidly (off by default). See the proposed staged
> implementation described in the Implementation Plan section.
>
>
> ThinLTO Overview
> ==============
>
> See the talk slides linked above for more details. The following is a
> high-level overview of the motivation.
>
> Cross Module Optimization (CMO) is an effective means for improving
> runtime performance, by extending the scope of optimizations across
> source module boundaries. Without CMO, the compiler is limited to
> optimizing within the scope of single source modules. Two solutions
> for enabling CMO are Link-Time Optimization (LTO), which is currently
> supported in LLVM and GCC, and Lightweight-Interprocedural
> Optimization (LIPO). However, each of these solutions has limitations
> that prevent it from being enabled by default. ThinLTO is a new
> approach that attempts to address these limitations, with a goal of
> being enabled more broadly. ThinLTO is designed with many of the same
> principals as LIPO, and therefore its advantages, without any of its
> inherent weakness. Unlike in LIPO where the module group decision is
> made at profile training runtime, ThinLTO makes the decision at
> compile time, but in a lazy mode that facilitates large scale
> parallelism. The serial linker plugin phase is designed to be razor
> thin and blazingly fast. By default this step only does minimal
> preparation work to enable the parallel lazy importing performed
> later. ThinLTO aims to be scalable like a regular O2 build, enabling
> CMO on machines without large memory configurations, while also
> integrating well with distributed build systems. Results from early
> prototyping on SPEC cpu2006 C++ benchmarks are in line with
> expectations that ThinLTO can scale like O2 while enabling much of the
> CMO performed during a full LTO build.
>
>
> A ThinLTO build is divided into 3 phases, which are referred to in the
> following implementation plan:
>
> phase-1: IR and Function Summary Generation (-c compile)
> phase-2: Thin Linker Plugin Layer (thin archive linker step)
> phase-3: Parallel Backend with Demand-Driven Importing
>
>
> Implementation Plan
> ================
>
> This section gives a high-level breakdown of the ThinLTO support that
> will be added, in roughly the order that the patches would be staged.
> The patches are divided into three stages. The first stage contains a
> minimal amount of preparation work that is not ThinLTO-specific. The
> second stage contains most of the infrastructure for ThinLTO, which
> will be off by default. The third stage includes
> enhancements/improvements/tunings that can be performed after the main
> ThinLTO infrastructure is in.
>
> The second and third implementation stages will initially be very
> volatile, requiring a lot of iterations and tuning with large apps to
> get stabilized. Therefore it will be important to do fast commits for
> these implementation stages.
>
>
> 1. Stage 1: Preparation
> -------------------------------
>
> The first planned sets of patches are enablers for ThinLTO work:
>
>
> a. LTO directory structure:
>
> Restructure the LTO directory to remove circular dependence when
> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> within Transforms/IPO, and leverages the LTOModule class for linking
> in functions from modules, IPO then requires the LTO library. This
> creates a circular dependence between LTO and IPO. To break that, we
> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> the circular dependence.
>
>
> b. ELF wrapper generation support:
>
> Implement ELF wrapped bitcode writer. In order to more easily interact
> with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> table. The goal is both to interact with these tools without requiring
> a plugin, and also to avoid doing partial LTO/ThinLTO across files
> linked with “$LD -r” (i.e. the resulting object file should still
> contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
> I will send a separate design document for these changes, but the
> following is a high-level overview.
>
> Support was added to LLVM for reading ELF-wrapped bitcode
> (http://reviews.llvm.org/rL218078), but there does not yet exist
> support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> add support for optionally generating bitcode in an ELF file
> containing a single .llvmbc section holding the bitcode. Specifically,
> the patch would add new options “emit-llvm-bc-elf” (object file) and
> corresponding “emit-llvm-elf” (textual assembly code equivalent).
> Eventually these would be automatically triggered under “-fthinlto -c”
> and “-fthinlto -S”, respectively.
>
> Additionally, a symbol table will be generated in the ELF file,
> holding the function symbols within the bitcode. This facilitates
> handling archives of the ELF-wrapped bitcode created with $AR, since
> the archive will have a symbol table as well. The archive symbol table
> enables gold to extract and pass to the plugin the constituent
> ELF-wrapped bitcode files. To support the concatenated llvmbc section
> generated by “$LD -r”, some handling needs to be added to gold and to
> the backend driver to process each original module’s bitcode.
>
> The function index/summary will later be added as a special ELF
> section alongside the .llvmbc sections.
>
>
> 2. Stage 2: ThinLTO Infrastructure
> ----------------------------------------------
>
> The next set of patches adds the base implementation of the ThinLTO
> infrastructure, specifically those required to make ThinLTO functional
> and generate correct but not necessarily high-performing binaries. It
> also does not include support to make debug support under -g efficient
> with ThinLTO.
>
>
> a. Clang/LLVM/gold linker options:
>
> An early set of clang/llvm patches is needed to provide options to
> enable ThinLTO (off by default), so that the rest of the
> implementation can be disabled by default as it is added.
> Specifically, clang options -fthinlto (used instead of -flto) will
> cause clang to invoke the phase-1 emission of LLVM bitcode and
> function summary/index on a compile step, and pass the appropriate
> option to the gold plugin on a link step. The -thinlto option will be
> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> archive step. The -thinlto option will also be added to the ‘opt’ tool
> to invoke it as a phase-3 parallel backend instance.
>
>
> b. Thin-archive linking support in Gold plugin and llvm-lto:
>
> Under the new plugin option (see above), the plugin needs to perform
> the phase-2 (thin archive) link which simply emits a combined function
> map from the linked modules, without actually performing the normal
> link. Corresponding support should be added to the standalone llvm-lto
> tool to enable testing/debugging without involving the linker and
> plugin.
>
>
> c. ThinLTO backend support:
>
> Support for invoking a phase-3 backend invocation (including
> importing) on a module should be added to the ‘opt’ tool under the new
> option. The main change under the option is to instantiate a Linker
> object used to manage the process of linking imported functions into
> the module, efficient read of the combined function map, and enable
> the ThinLTO import pass.
>
>
> d. Function index/summary support:
>
> This includes infrastructure for writing and reading the function
> index/summary section. As noted earlier this will be encoded in a
> special ELF section within the module, alongside the .llvmbc section
> containing the bitcode. The thin archive generated by phase-2 of
> ThinLTO simply contains all of the function index/summary sections
> across the linked modules, organized for efficient function lookup.
>
> Each function available for importing from the module contains an
> entry in the module’s function index/summary section and in the
> resulting combined function map. Each function entry contains that
> function’s offset within the bitcode file, used to efficiently locate
> and quickly import just that function. The entry also contains summary
> information (e.g. basic information determined during parsing such as
> the number of instructions in the function), that will be used to help
> guide later import decisions. Because the contents of this section
> will change frequently during ThinLTO tuning, it should also be marked
> with a version id for backwards compatibility or version checking.
>
>
> e. ThinLTO importing support:
>
> Support for the mechanics of importing functions from other modules,
> which can go in gradually as a set of patches since it will be off by
> default. Separate patches can include:
>
> - BitcodeReader changes to use function index to import/deserialize
> single function of interest (small changes, leverages existing lazy
> streamer support).
>
> - Minor LTOModule changes to pass the ThinLTO function to import and
> its index into bitcode reader.
>
> - Marking of imported functions (for use in ThinLTO-specific symbol
> linking and global DCE, for example). This can be in-memory initially,
> but IR support may be required in order to support streaming bitcode
> out and back in again after importing.
>
> - ModuleLinker changes to do ThinLTO-specific symbol linking and
> static promotion when necessary. The linkage type of imported
> functions changes to AvailableExternallyLinkage, for example. Statics
> must be promoted in certain cases, and renamed in consistent ways.
>
> - GlobalDCE changes to support removing imported functions that were
> not inlined (very small changes to existing pass logic).
>
>
> f. ThinLTO Import Driver SCC pass:
>
> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> an SCC pass, enabled only under -fthinlto options. The pass includes
> utilizing the thin archive (global function index/summary), import
> decision heuristics, invocation of LTOModule/ModuleLinker routines
> that perform the import, and any necessary callgraph updates and
> verification.
>
>
> g. Backend Driver:
>
> For a single node build, the gold plugin can simply write a makefile
> and fork the parallel backend instances directly via parallel make.
>
>
> 3. Stage 3: ThinLTO Tuning and Enhancements
> ----------------------------------------------------------------
>
> This refers to the patches that are not required for ThinLTO to work,
> but rather to improve compile time, memory, run-time performance and
> usability.
>
>
> a. Lazy Debug Metadata Linking:
>
> The prototype implementation included lazy importing of module-level
> metadata during the ThinLTO pass finalization (i.e. after all function
> importing is complete). This actually applies to all module-level
> metadata, not just debug, although it is the largest. This can be
> added as a separate set of patches. Changes to BitcodeReader,
> ValueMapper, ModuleLinker
>
>
> b. Import Tuning:
>
> Tuning the import strategy will be an iterative process that will
> continue to be refined over time. It involves several different types
> of changes: adding support for recording additional metrics in the
> function summary, such as profile data and optional heavier-weight IPA
> analyses, and tuning the import heuristics based on the summary and
> callsite context.
>
>
> c. Combined Function Map Pruning:
>
> The combined function map can be pruned of functions that are unlikely
> to benefit from being imported. For example, during the phase-2 thin
> archive plug step we can safely omit large and (with profile data)
> cold functions, which are unlikely to benefit from being inlined.
> Additionally, all but one copy of comdat functions can be suppressed.
>
>
> d. Distributed Build System Integration:
>
> For a distributed build system, the gold plugin should write the
> parallel backend invocations into a makefile, including the mapping
> from the IR file to the real object file path, and exit. Additional
> work needs to be done in the distributed build system itself to
> distribute and dispatch the parallel backend jobs to the build
> cluster.
>
>
> e. Dependence Tracking and Incremental Compiles:
>
> In order to support build systems that stage from local disks or
> network storage, the plugin will optionally support computation of
> dependent sets of IR files that each module may import from. This can
> be computed from profile data, if it exists, or from the symbol table
> and heuristics if not. These dependence sets also enable support for
> incremental backend compiles.
>
>
>
> --
> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Teresa Johnson
On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
<[hidden email]> wrote:

>
>
> On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
> wrote:
>>
>> "ELF-wrapped bitcode" seems potentially controversial to me.
>>
>> What about ar, nm, and various ld implementations adds this requirement?
>> What about the LLVM implementations of these tools is lacking?
>
>
> Sorry I can not parse your questions properly. Can you make it clearer?

Alex is asking what the issue is with ar, nm, ld -r and regular
bitcode that makes using elf-wrapped bitcode easier.

The issue is that generally you need to provide a plugin to these
tools in order for them to understand and handle bitcode files. We'd
like standard tools to work without requiring a plugin as much as
possible. And in some cases we want them to be handled different than
the way bitcode files are handled with the plugin.

nm: Without a plugin, normal bitcode files are inscrutable. When
provided the gold plugin it can emit the symbols.

ar: Without a plugin, it will create an archive of bitcode files, but
without an index, so it can't be handled by the linker even with a
plugin on an -flto link. When ar is provided the gold plugin it does
create an index, so the linker + gold plugin handle it appropriately
on an -flto link.

ld -r: Without a plugin, fails when provided bitcode inputs. When
provided the gold plugin, it handles them but compiles them all the
way through to ELF executable instructions via a partial LTO link.
This is where we would like to differ in behavior (while also not
requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
output file to still contain ELF-wrapped bitcode, delaying the LTO
until the full link step.

Let me know if that helps address your concerns.

Thanks,
Teresa

>
> David
>
>>
>>
>> Alex
>>
>> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> > wrote:
>> >
>> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> > forward to feedback and questions.
>> > Thanks!
>> > Teresa
>> >
>> >
>> >
>> > RFC to discuss plans for implementing ThinLTO upstream. Background can
>> > be found in slides from EuroLLVM 2015:
>> >
>> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> > As described in the talk, we have a prototype implementation, and
>> > would like to start staging patches upstream. This RFC describes a
>> > breakdown of the major pieces. We would like to commit upstream
>> > gradually in several stages, with all functionality off by default.
>> > The core ThinLTO importing support and tuning will require frequent
>> > change and iteration during testing and tuning, and for that part we
>> > would like to commit rapidly (off by default). See the proposed staged
>> > implementation described in the Implementation Plan section.
>> >
>> >
>> > ThinLTO Overview
>> > ==============
>> >
>> > See the talk slides linked above for more details. The following is a
>> > high-level overview of the motivation.
>> >
>> > Cross Module Optimization (CMO) is an effective means for improving
>> > runtime performance, by extending the scope of optimizations across
>> > source module boundaries. Without CMO, the compiler is limited to
>> > optimizing within the scope of single source modules. Two solutions
>> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> > Optimization (LIPO). However, each of these solutions has limitations
>> > that prevent it from being enabled by default. ThinLTO is a new
>> > approach that attempts to address these limitations, with a goal of
>> > being enabled more broadly. ThinLTO is designed with many of the same
>> > principals as LIPO, and therefore its advantages, without any of its
>> > inherent weakness. Unlike in LIPO where the module group decision is
>> > made at profile training runtime, ThinLTO makes the decision at
>> > compile time, but in a lazy mode that facilitates large scale
>> > parallelism. The serial linker plugin phase is designed to be razor
>> > thin and blazingly fast. By default this step only does minimal
>> > preparation work to enable the parallel lazy importing performed
>> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> > CMO on machines without large memory configurations, while also
>> > integrating well with distributed build systems. Results from early
>> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> > expectations that ThinLTO can scale like O2 while enabling much of the
>> > CMO performed during a full LTO build.
>> >
>> >
>> > A ThinLTO build is divided into 3 phases, which are referred to in the
>> > following implementation plan:
>> >
>> > phase-1: IR and Function Summary Generation (-c compile)
>> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> > phase-3: Parallel Backend with Demand-Driven Importing
>> >
>> >
>> > Implementation Plan
>> > ================
>> >
>> > This section gives a high-level breakdown of the ThinLTO support that
>> > will be added, in roughly the order that the patches would be staged.
>> > The patches are divided into three stages. The first stage contains a
>> > minimal amount of preparation work that is not ThinLTO-specific. The
>> > second stage contains most of the infrastructure for ThinLTO, which
>> > will be off by default. The third stage includes
>> > enhancements/improvements/tunings that can be performed after the main
>> > ThinLTO infrastructure is in.
>> >
>> > The second and third implementation stages will initially be very
>> > volatile, requiring a lot of iterations and tuning with large apps to
>> > get stabilized. Therefore it will be important to do fast commits for
>> > these implementation stages.
>> >
>> >
>> > 1. Stage 1: Preparation
>> > -------------------------------
>> >
>> > The first planned sets of patches are enablers for ThinLTO work:
>> >
>> >
>> > a. LTO directory structure:
>> >
>> > Restructure the LTO directory to remove circular dependence when
>> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
>> > within Transforms/IPO, and leverages the LTOModule class for linking
>> > in functions from modules, IPO then requires the LTO library. This
>> > creates a circular dependence between LTO and IPO. To break that, we
>> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> > the circular dependence.
>> >
>> >
>> > b. ELF wrapper generation support:
>> >
>> > Implement ELF wrapped bitcode writer. In order to more easily interact
>> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> > table. The goal is both to interact with these tools without requiring
>> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> > linked with “$LD -r” (i.e. the resulting object file should still
>> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> > I will send a separate design document for these changes, but the
>> > following is a high-level overview.
>> >
>> > Support was added to LLVM for reading ELF-wrapped bitcode
>> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> > add support for optionally generating bitcode in an ELF file
>> > containing a single .llvmbc section holding the bitcode. Specifically,
>> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> > Eventually these would be automatically triggered under “-fthinlto -c”
>> > and “-fthinlto -S”, respectively.
>> >
>> > Additionally, a symbol table will be generated in the ELF file,
>> > holding the function symbols within the bitcode. This facilitates
>> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> > the archive will have a symbol table as well. The archive symbol table
>> > enables gold to extract and pass to the plugin the constituent
>> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> > generated by “$LD -r”, some handling needs to be added to gold and to
>> > the backend driver to process each original module’s bitcode.
>> >
>> > The function index/summary will later be added as a special ELF
>> > section alongside the .llvmbc sections.
>> >
>> >
>> > 2. Stage 2: ThinLTO Infrastructure
>> > ----------------------------------------------
>> >
>> > The next set of patches adds the base implementation of the ThinLTO
>> > infrastructure, specifically those required to make ThinLTO functional
>> > and generate correct but not necessarily high-performing binaries. It
>> > also does not include support to make debug support under -g efficient
>> > with ThinLTO.
>> >
>> >
>> > a. Clang/LLVM/gold linker options:
>> >
>> > An early set of clang/llvm patches is needed to provide options to
>> > enable ThinLTO (off by default), so that the rest of the
>> > implementation can be disabled by default as it is added.
>> > Specifically, clang options -fthinlto (used instead of -flto) will
>> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> > function summary/index on a compile step, and pass the appropriate
>> > option to the gold plugin on a link step. The -thinlto option will be
>> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> > archive step. The -thinlto option will also be added to the ‘opt’ tool
>> > to invoke it as a phase-3 parallel backend instance.
>> >
>> >
>> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >
>> > Under the new plugin option (see above), the plugin needs to perform
>> > the phase-2 (thin archive) link which simply emits a combined function
>> > map from the linked modules, without actually performing the normal
>> > link. Corresponding support should be added to the standalone llvm-lto
>> > tool to enable testing/debugging without involving the linker and
>> > plugin.
>> >
>> >
>> > c. ThinLTO backend support:
>> >
>> > Support for invoking a phase-3 backend invocation (including
>> > importing) on a module should be added to the ‘opt’ tool under the new
>> > option. The main change under the option is to instantiate a Linker
>> > object used to manage the process of linking imported functions into
>> > the module, efficient read of the combined function map, and enable
>> > the ThinLTO import pass.
>> >
>> >
>> > d. Function index/summary support:
>> >
>> > This includes infrastructure for writing and reading the function
>> > index/summary section. As noted earlier this will be encoded in a
>> > special ELF section within the module, alongside the .llvmbc section
>> > containing the bitcode. The thin archive generated by phase-2 of
>> > ThinLTO simply contains all of the function index/summary sections
>> > across the linked modules, organized for efficient function lookup.
>> >
>> > Each function available for importing from the module contains an
>> > entry in the module’s function index/summary section and in the
>> > resulting combined function map. Each function entry contains that
>> > function’s offset within the bitcode file, used to efficiently locate
>> > and quickly import just that function. The entry also contains summary
>> > information (e.g. basic information determined during parsing such as
>> > the number of instructions in the function), that will be used to help
>> > guide later import decisions. Because the contents of this section
>> > will change frequently during ThinLTO tuning, it should also be marked
>> > with a version id for backwards compatibility or version checking.
>> >
>> >
>> > e. ThinLTO importing support:
>> >
>> > Support for the mechanics of importing functions from other modules,
>> > which can go in gradually as a set of patches since it will be off by
>> > default. Separate patches can include:
>> >
>> > - BitcodeReader changes to use function index to import/deserialize
>> > single function of interest (small changes, leverages existing lazy
>> > streamer support).
>> >
>> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> > its index into bitcode reader.
>> >
>> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> > linking and global DCE, for example). This can be in-memory initially,
>> > but IR support may be required in order to support streaming bitcode
>> > out and back in again after importing.
>> >
>> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> > static promotion when necessary. The linkage type of imported
>> > functions changes to AvailableExternallyLinkage, for example. Statics
>> > must be promoted in certain cases, and renamed in consistent ways.
>> >
>> > - GlobalDCE changes to support removing imported functions that were
>> > not inlined (very small changes to existing pass logic).
>> >
>> >
>> > f. ThinLTO Import Driver SCC pass:
>> >
>> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> > utilizing the thin archive (global function index/summary), import
>> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> > that perform the import, and any necessary callgraph updates and
>> > verification.
>> >
>> >
>> > g. Backend Driver:
>> >
>> > For a single node build, the gold plugin can simply write a makefile
>> > and fork the parallel backend instances directly via parallel make.
>> >
>> >
>> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> > ----------------------------------------------------------------
>> >
>> > This refers to the patches that are not required for ThinLTO to work,
>> > but rather to improve compile time, memory, run-time performance and
>> > usability.
>> >
>> >
>> > a. Lazy Debug Metadata Linking:
>> >
>> > The prototype implementation included lazy importing of module-level
>> > metadata during the ThinLTO pass finalization (i.e. after all function
>> > importing is complete). This actually applies to all module-level
>> > metadata, not just debug, although it is the largest. This can be
>> > added as a separate set of patches. Changes to BitcodeReader,
>> > ValueMapper, ModuleLinker
>> >
>> >
>> > b. Import Tuning:
>> >
>> > Tuning the import strategy will be an iterative process that will
>> > continue to be refined over time. It involves several different types
>> > of changes: adding support for recording additional metrics in the
>> > function summary, such as profile data and optional heavier-weight IPA
>> > analyses, and tuning the import heuristics based on the summary and
>> > callsite context.
>> >
>> >
>> > c. Combined Function Map Pruning:
>> >
>> > The combined function map can be pruned of functions that are unlikely
>> > to benefit from being imported. For example, during the phase-2 thin
>> > archive plug step we can safely omit large and (with profile data)
>> > cold functions, which are unlikely to benefit from being inlined.
>> > Additionally, all but one copy of comdat functions can be suppressed.
>> >
>> >
>> > d. Distributed Build System Integration:
>> >
>> > For a distributed build system, the gold plugin should write the
>> > parallel backend invocations into a makefile, including the mapping
>> > from the IR file to the real object file path, and exit. Additional
>> > work needs to be done in the distributed build system itself to
>> > distribute and dispatch the parallel backend jobs to the build
>> > cluster.
>> >
>> >
>> > e. Dependence Tracking and Incremental Compiles:
>> >
>> > In order to support build systems that stage from local disks or
>> > network storage, the plugin will optionally support computation of
>> > dependent sets of IR files that each module may import from. This can
>> > be computed from profile data, if it exists, or from the symbol table
>> > and heuristics if not. These dependence sets also enable support for
>> > incremental backend compiles.
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > [hidden email]         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Eric Christopher

So, what Alex is saying is that we have these tools as well and they understand bitcode just fine, as well as every object format - not just ELF. :)

-eric


On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
<[hidden email]> wrote:
>
>
> On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
> wrote:
>>
>> "ELF-wrapped bitcode" seems potentially controversial to me.
>>
>> What about ar, nm, and various ld implementations adds this requirement?
>> What about the LLVM implementations of these tools is lacking?
>
>
> Sorry I can not parse your questions properly. Can you make it clearer?

Alex is asking what the issue is with ar, nm, ld -r and regular
bitcode that makes using elf-wrapped bitcode easier.

The issue is that generally you need to provide a plugin to these
tools in order for them to understand and handle bitcode files. We'd
like standard tools to work without requiring a plugin as much as
possible. And in some cases we want them to be handled different than
the way bitcode files are handled with the plugin.

nm: Without a plugin, normal bitcode files are inscrutable. When
provided the gold plugin it can emit the symbols.

ar: Without a plugin, it will create an archive of bitcode files, but
without an index, so it can't be handled by the linker even with a
plugin on an -flto link. When ar is provided the gold plugin it does
create an index, so the linker + gold plugin handle it appropriately
on an -flto link.

ld -r: Without a plugin, fails when provided bitcode inputs. When
provided the gold plugin, it handles them but compiles them all the
way through to ELF executable instructions via a partial LTO link.
This is where we would like to differ in behavior (while also not
requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
output file to still contain ELF-wrapped bitcode, delaying the LTO
until the full link step.

Let me know if that helps address your concerns.

Thanks,
Teresa

>
> David
>
>>
>>
>> Alex
>>
>> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> > wrote:
>> >
>> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> > forward to feedback and questions.
>> > Thanks!
>> > Teresa
>> >
>> >
>> >
>> > RFC to discuss plans for implementing ThinLTO upstream. Background can
>> > be found in slides from EuroLLVM 2015:
>> >
>> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> > As described in the talk, we have a prototype implementation, and
>> > would like to start staging patches upstream. This RFC describes a
>> > breakdown of the major pieces. We would like to commit upstream
>> > gradually in several stages, with all functionality off by default.
>> > The core ThinLTO importing support and tuning will require frequent
>> > change and iteration during testing and tuning, and for that part we
>> > would like to commit rapidly (off by default). See the proposed staged
>> > implementation described in the Implementation Plan section.
>> >
>> >
>> > ThinLTO Overview
>> > ==============
>> >
>> > See the talk slides linked above for more details. The following is a
>> > high-level overview of the motivation.
>> >
>> > Cross Module Optimization (CMO) is an effective means for improving
>> > runtime performance, by extending the scope of optimizations across
>> > source module boundaries. Without CMO, the compiler is limited to
>> > optimizing within the scope of single source modules. Two solutions
>> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> > Optimization (LIPO). However, each of these solutions has limitations
>> > that prevent it from being enabled by default. ThinLTO is a new
>> > approach that attempts to address these limitations, with a goal of
>> > being enabled more broadly. ThinLTO is designed with many of the same
>> > principals as LIPO, and therefore its advantages, without any of its
>> > inherent weakness. Unlike in LIPO where the module group decision is
>> > made at profile training runtime, ThinLTO makes the decision at
>> > compile time, but in a lazy mode that facilitates large scale
>> > parallelism. The serial linker plugin phase is designed to be razor
>> > thin and blazingly fast. By default this step only does minimal
>> > preparation work to enable the parallel lazy importing performed
>> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> > CMO on machines without large memory configurations, while also
>> > integrating well with distributed build systems. Results from early
>> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> > expectations that ThinLTO can scale like O2 while enabling much of the
>> > CMO performed during a full LTO build.
>> >
>> >
>> > A ThinLTO build is divided into 3 phases, which are referred to in the
>> > following implementation plan:
>> >
>> > phase-1: IR and Function Summary Generation (-c compile)
>> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> > phase-3: Parallel Backend with Demand-Driven Importing
>> >
>> >
>> > Implementation Plan
>> > ================
>> >
>> > This section gives a high-level breakdown of the ThinLTO support that
>> > will be added, in roughly the order that the patches would be staged.
>> > The patches are divided into three stages. The first stage contains a
>> > minimal amount of preparation work that is not ThinLTO-specific. The
>> > second stage contains most of the infrastructure for ThinLTO, which
>> > will be off by default. The third stage includes
>> > enhancements/improvements/tunings that can be performed after the main
>> > ThinLTO infrastructure is in.
>> >
>> > The second and third implementation stages will initially be very
>> > volatile, requiring a lot of iterations and tuning with large apps to
>> > get stabilized. Therefore it will be important to do fast commits for
>> > these implementation stages.
>> >
>> >
>> > 1. Stage 1: Preparation
>> > -------------------------------
>> >
>> > The first planned sets of patches are enablers for ThinLTO work:
>> >
>> >
>> > a. LTO directory structure:
>> >
>> > Restructure the LTO directory to remove circular dependence when
>> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
>> > within Transforms/IPO, and leverages the LTOModule class for linking
>> > in functions from modules, IPO then requires the LTO library. This
>> > creates a circular dependence between LTO and IPO. To break that, we
>> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> > the circular dependence.
>> >
>> >
>> > b. ELF wrapper generation support:
>> >
>> > Implement ELF wrapped bitcode writer. In order to more easily interact
>> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> > table. The goal is both to interact with these tools without requiring
>> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> > linked with “$LD -r” (i.e. the resulting object file should still
>> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> > I will send a separate design document for these changes, but the
>> > following is a high-level overview.
>> >
>> > Support was added to LLVM for reading ELF-wrapped bitcode
>> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> > add support for optionally generating bitcode in an ELF file
>> > containing a single .llvmbc section holding the bitcode. Specifically,
>> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> > Eventually these would be automatically triggered under “-fthinlto -c”
>> > and “-fthinlto -S”, respectively.
>> >
>> > Additionally, a symbol table will be generated in the ELF file,
>> > holding the function symbols within the bitcode. This facilitates
>> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> > the archive will have a symbol table as well. The archive symbol table
>> > enables gold to extract and pass to the plugin the constituent
>> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> > generated by “$LD -r”, some handling needs to be added to gold and to
>> > the backend driver to process each original module’s bitcode.
>> >
>> > The function index/summary will later be added as a special ELF
>> > section alongside the .llvmbc sections.
>> >
>> >
>> > 2. Stage 2: ThinLTO Infrastructure
>> > ----------------------------------------------
>> >
>> > The next set of patches adds the base implementation of the ThinLTO
>> > infrastructure, specifically those required to make ThinLTO functional
>> > and generate correct but not necessarily high-performing binaries. It
>> > also does not include support to make debug support under -g efficient
>> > with ThinLTO.
>> >
>> >
>> > a. Clang/LLVM/gold linker options:
>> >
>> > An early set of clang/llvm patches is needed to provide options to
>> > enable ThinLTO (off by default), so that the rest of the
>> > implementation can be disabled by default as it is added.
>> > Specifically, clang options -fthinlto (used instead of -flto) will
>> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> > function summary/index on a compile step, and pass the appropriate
>> > option to the gold plugin on a link step. The -thinlto option will be
>> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> > archive step. The -thinlto option will also be added to the ‘opt’ tool
>> > to invoke it as a phase-3 parallel backend instance.
>> >
>> >
>> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >
>> > Under the new plugin option (see above), the plugin needs to perform
>> > the phase-2 (thin archive) link which simply emits a combined function
>> > map from the linked modules, without actually performing the normal
>> > link. Corresponding support should be added to the standalone llvm-lto
>> > tool to enable testing/debugging without involving the linker and
>> > plugin.
>> >
>> >
>> > c. ThinLTO backend support:
>> >
>> > Support for invoking a phase-3 backend invocation (including
>> > importing) on a module should be added to the ‘opt’ tool under the new
>> > option. The main change under the option is to instantiate a Linker
>> > object used to manage the process of linking imported functions into
>> > the module, efficient read of the combined function map, and enable
>> > the ThinLTO import pass.
>> >
>> >
>> > d. Function index/summary support:
>> >
>> > This includes infrastructure for writing and reading the function
>> > index/summary section. As noted earlier this will be encoded in a
>> > special ELF section within the module, alongside the .llvmbc section
>> > containing the bitcode. The thin archive generated by phase-2 of
>> > ThinLTO simply contains all of the function index/summary sections
>> > across the linked modules, organized for efficient function lookup.
>> >
>> > Each function available for importing from the module contains an
>> > entry in the module’s function index/summary section and in the
>> > resulting combined function map. Each function entry contains that
>> > function’s offset within the bitcode file, used to efficiently locate
>> > and quickly import just that function. The entry also contains summary
>> > information (e.g. basic information determined during parsing such as
>> > the number of instructions in the function), that will be used to help
>> > guide later import decisions. Because the contents of this section
>> > will change frequently during ThinLTO tuning, it should also be marked
>> > with a version id for backwards compatibility or version checking.
>> >
>> >
>> > e. ThinLTO importing support:
>> >
>> > Support for the mechanics of importing functions from other modules,
>> > which can go in gradually as a set of patches since it will be off by
>> > default. Separate patches can include:
>> >
>> > - BitcodeReader changes to use function index to import/deserialize
>> > single function of interest (small changes, leverages existing lazy
>> > streamer support).
>> >
>> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> > its index into bitcode reader.
>> >
>> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> > linking and global DCE, for example). This can be in-memory initially,
>> > but IR support may be required in order to support streaming bitcode
>> > out and back in again after importing.
>> >
>> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> > static promotion when necessary. The linkage type of imported
>> > functions changes to AvailableExternallyLinkage, for example. Statics
>> > must be promoted in certain cases, and renamed in consistent ways.
>> >
>> > - GlobalDCE changes to support removing imported functions that were
>> > not inlined (very small changes to existing pass logic).
>> >
>> >
>> > f. ThinLTO Import Driver SCC pass:
>> >
>> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> > utilizing the thin archive (global function index/summary), import
>> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> > that perform the import, and any necessary callgraph updates and
>> > verification.
>> >
>> >
>> > g. Backend Driver:
>> >
>> > For a single node build, the gold plugin can simply write a makefile
>> > and fork the parallel backend instances directly via parallel make.
>> >
>> >
>> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> > ----------------------------------------------------------------
>> >
>> > This refers to the patches that are not required for ThinLTO to work,
>> > but rather to improve compile time, memory, run-time performance and
>> > usability.
>> >
>> >
>> > a. Lazy Debug Metadata Linking:
>> >
>> > The prototype implementation included lazy importing of module-level
>> > metadata during the ThinLTO pass finalization (i.e. after all function
>> > importing is complete). This actually applies to all module-level
>> > metadata, not just debug, although it is the largest. This can be
>> > added as a separate set of patches. Changes to BitcodeReader,
>> > ValueMapper, ModuleLinker
>> >
>> >
>> > b. Import Tuning:
>> >
>> > Tuning the import strategy will be an iterative process that will
>> > continue to be refined over time. It involves several different types
>> > of changes: adding support for recording additional metrics in the
>> > function summary, such as profile data and optional heavier-weight IPA
>> > analyses, and tuning the import heuristics based on the summary and
>> > callsite context.
>> >
>> >
>> > c. Combined Function Map Pruning:
>> >
>> > The combined function map can be pruned of functions that are unlikely
>> > to benefit from being imported. For example, during the phase-2 thin
>> > archive plug step we can safely omit large and (with profile data)
>> > cold functions, which are unlikely to benefit from being inlined.
>> > Additionally, all but one copy of comdat functions can be suppressed.
>> >
>> >
>> > d. Distributed Build System Integration:
>> >
>> > For a distributed build system, the gold plugin should write the
>> > parallel backend invocations into a makefile, including the mapping
>> > from the IR file to the real object file path, and exit. Additional
>> > work needs to be done in the distributed build system itself to
>> > distribute and dispatch the parallel backend jobs to the build
>> > cluster.
>> >
>> >
>> > e. Dependence Tracking and Incremental Compiles:
>> >
>> > In order to support build systems that stage from local disks or
>> > network storage, the plugin will optionally support computation of
>> > dependent sets of IR files that each module may import from. This can
>> > be computed from profile data, if it exists, or from the symbol table
>> > and heuristics if not. These dependence sets also enable support for
>> > incremental backend compiles.
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > [hidden email]         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Teresa Johnson
On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa

>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > 408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Xinliang David Li-2
The design objective is to make thinLTO mostly transparent to binutil tools to enable easy integration with any build system in the wild.  'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another reason.

David

On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]> wrote:
On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa

>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413">408-460-2413


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Eric Christopher
I'm not sure this is a particularly great assumption to make. We have to support a lot of different build systems and tools and concentrating on something that just binutils uses isn't particularly friendly here. I also can't imagine how it's necessary for any of the lto aspects as currently written in the proposal.

-eric

On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]> wrote:
The design objective is to make thinLTO mostly transparent to binutil tools to enable easy integration with any build system in the wild.  'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another reason.

David

On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]> wrote:
On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa

>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Daniel Berlin
On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
> I'm not sure this is a particularly great assumption to make.

Which part?

>  We have to
> support a lot of different build systems and tools and concentrating on
> something that just binutils uses isn't particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to *all of* these tools.
You are saying "we should be friendly to everyone". He is saying the same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.


> I also
> can't imagine how it's necessary for any of the lto aspects as currently
> written in the proposal.
>
> -eric
>
> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>> David
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>>
>>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>>> wrote:
>>> > So, what Alex is saying is that we have these tools as well and they
>>> > understand bitcode just fine, as well as every object format - not just
>>> > ELF.
>>> > :)
>>>
>>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>> handle bitcode similarly to the way the standard tool + plugin does.
>>> But the goal we are trying to achieve is to allow the standard system
>>> versions of the tools to handle these files without requiring a
>>> plugin. I know the LLVM tool handles other object formats, but I'm not
>>> sure how that helps here? We're not planning to replace those tools,
>>> just allow the standard system versions to handle the intermediate
>>> objects produced by ThinLTO.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > -eric
>>> >
>>> >
>>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>>> > wrote:
>>> >>
>>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>> >> <[hidden email]> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>> >> > <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>> >> >>
>>> >> >> What about ar, nm, and various ld implementations adds this
>>> >> >> requirement?
>>> >> >> What about the LLVM implementations of these tools is lacking?
>>> >> >
>>> >> >
>>> >> > Sorry I can not parse your questions properly. Can you make it
>>> >> > clearer?
>>> >>
>>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>> >> bitcode that makes using elf-wrapped bitcode easier.
>>> >>
>>> >> The issue is that generally you need to provide a plugin to these
>>> >> tools in order for them to understand and handle bitcode files. We'd
>>> >> like standard tools to work without requiring a plugin as much as
>>> >> possible. And in some cases we want them to be handled different than
>>> >> the way bitcode files are handled with the plugin.
>>> >>
>>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>> >> provided the gold plugin it can emit the symbols.
>>> >>
>>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>>> >> without an index, so it can't be handled by the linker even with a
>>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>>> >> create an index, so the linker + gold plugin handle it appropriately
>>> >> on an -flto link.
>>> >>
>>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>> >> provided the gold plugin, it handles them but compiles them all the
>>> >> way through to ELF executable instructions via a partial LTO link.
>>> >> This is where we would like to differ in behavior (while also not
>>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>>> >> until the full link step.
>>> >>
>>> >> Let me know if that helps address your concerns.
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Alex
>>> >> >>
>>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>> >> >> > <[hidden email]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>> >> >> > looking
>>> >> >> > forward to feedback and questions.
>>> >> >> > Thanks!
>>> >> >> > Teresa
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>> >> >> > Background
>>> >> >> > can
>>> >> >> > be found in slides from EuroLLVM 2015:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>> >> >> > As described in the talk, we have a prototype implementation, and
>>> >> >> > would like to start staging patches upstream. This RFC describes
>>> >> >> > a
>>> >> >> > breakdown of the major pieces. We would like to commit upstream
>>> >> >> > gradually in several stages, with all functionality off by
>>> >> >> > default.
>>> >> >> > The core ThinLTO importing support and tuning will require
>>> >> >> > frequent
>>> >> >> > change and iteration during testing and tuning, and for that part
>>> >> >> > we
>>> >> >> > would like to commit rapidly (off by default). See the proposed
>>> >> >> > staged
>>> >> >> > implementation described in the Implementation Plan section.
>>> >> >> >
>>> >> >> >
>>> >> >> > ThinLTO Overview
>>> >> >> > ==============
>>> >> >> >
>>> >> >> > See the talk slides linked above for more details. The following
>>> >> >> > is a
>>> >> >> > high-level overview of the motivation.
>>> >> >> >
>>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>> >> >> > improving
>>> >> >> > runtime performance, by extending the scope of optimizations
>>> >> >> > across
>>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>>> >> >> > optimizing within the scope of single source modules. Two
>>> >> >> > solutions
>>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>> >> >> > currently
>>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>> >> >> > Optimization (LIPO). However, each of these solutions has
>>> >> >> > limitations
>>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>>> >> >> > approach that attempts to address these limitations, with a goal
>>> >> >> > of
>>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>>> >> >> > same
>>> >> >> > principals as LIPO, and therefore its advantages, without any of
>>> >> >> > its
>>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>>> >> >> > is
>>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>>> >> >> > compile time, but in a lazy mode that facilitates large scale
>>> >> >> > parallelism. The serial linker plugin phase is designed to be
>>> >> >> > razor
>>> >> >> > thin and blazingly fast. By default this step only does minimal
>>> >> >> > preparation work to enable the parallel lazy importing performed
>>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>> >> >> > enabling
>>> >> >> > CMO on machines without large memory configurations, while also
>>> >> >> > integrating well with distributed build systems. Results from
>>> >> >> > early
>>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>> >> >> > expectations that ThinLTO can scale like O2 while enabling much
>>> >> >> > of
>>> >> >> > the
>>> >> >> > CMO performed during a full LTO build.
>>> >> >> >
>>> >> >> >
>>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to
>>> >> >> > in
>>> >> >> > the
>>> >> >> > following implementation plan:
>>> >> >> >
>>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>> >> >> >
>>> >> >> >
>>> >> >> > Implementation Plan
>>> >> >> > ================
>>> >> >> >
>>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>>> >> >> > that
>>> >> >> > will be added, in roughly the order that the patches would be
>>> >> >> > staged.
>>> >> >> > The patches are divided into three stages. The first stage
>>> >> >> > contains a
>>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>>> >> >> > The
>>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>>> >> >> > which
>>> >> >> > will be off by default. The third stage includes
>>> >> >> > enhancements/improvements/tunings that can be performed after the
>>> >> >> > main
>>> >> >> > ThinLTO infrastructure is in.
>>> >> >> >
>>> >> >> > The second and third implementation stages will initially be very
>>> >> >> > volatile, requiring a lot of iterations and tuning with large
>>> >> >> > apps to
>>> >> >> > get stabilized. Therefore it will be important to do fast commits
>>> >> >> > for
>>> >> >> > these implementation stages.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1. Stage 1: Preparation
>>> >> >> > -------------------------------
>>> >> >> >
>>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>>> >> >> >
>>> >> >> >
>>> >> >> > a. LTO directory structure:
>>> >> >> >
>>> >> >> > Restructure the LTO directory to remove circular dependence when
>>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>>> >> >> > pass
>>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>> >> >> > linking
>>> >> >> > in functions from modules, IPO then requires the LTO library.
>>> >> >> > This
>>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>>> >> >> > we
>>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>>> >> >> > and
>>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>> >> >> > removing
>>> >> >> > the circular dependence.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. ELF wrapper generation support:
>>> >> >> >
>>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>>> >> >> > interact
>>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>>> >> >> > phase-1
>>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>> >> >> > symbol
>>> >> >> > table. The goal is both to interact with these tools without
>>> >> >> > requiring
>>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>> >> >> > files
>>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>>> >> >> > step).
>>> >> >> > I will send a separate design document for these changes, but the
>>> >> >> > following is a high-level overview.
>>> >> >> >
>>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>>> >> >> > to
>>> >> >> > add support for optionally generating bitcode in an ELF file
>>> >> >> > containing a single .llvmbc section holding the bitcode.
>>> >> >> > Specifically,
>>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>>> >> >> > and
>>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>>> >> >> > Eventually these would be automatically triggered under
>>> >> >> > “-fthinlto
>>> >> >> > -c”
>>> >> >> > and “-fthinlto -S”, respectively.
>>> >> >> >
>>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>>> >> >> > holding the function symbols within the bitcode. This facilitates
>>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>>> >> >> > since
>>> >> >> > the archive will have a symbol table as well. The archive symbol
>>> >> >> > table
>>> >> >> > enables gold to extract and pass to the plugin the constituent
>>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>>> >> >> > section
>>> >> >> > generated by “$LD -r”, some handling needs to be added to gold
>>> >> >> > and to
>>> >> >> > the backend driver to process each original module’s bitcode.
>>> >> >> >
>>> >> >> > The function index/summary will later be added as a special ELF
>>> >> >> > section alongside the .llvmbc sections.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>> >> >> > ----------------------------------------------
>>> >> >> >
>>> >> >> > The next set of patches adds the base implementation of the
>>> >> >> > ThinLTO
>>> >> >> > infrastructure, specifically those required to make ThinLTO
>>> >> >> > functional
>>> >> >> > and generate correct but not necessarily high-performing
>>> >> >> > binaries. It
>>> >> >> > also does not include support to make debug support under -g
>>> >> >> > efficient
>>> >> >> > with ThinLTO.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Clang/LLVM/gold linker options:
>>> >> >> >
>>> >> >> > An early set of clang/llvm patches is needed to provide options
>>> >> >> > to
>>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>> >> >> > implementation can be disabled by default as it is added.
>>> >> >> > Specifically, clang options -fthinlto (used instead of -flto)
>>> >> >> > will
>>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>>> >> >> > function summary/index on a compile step, and pass the
>>> >> >> > appropriate
>>> >> >> > option to the gold plugin on a link step. The -thinlto option
>>> >> >> > will be
>>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>>> >> >> > thin
>>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>>> >> >> > tool
>>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>> >> >> >
>>> >> >> > Under the new plugin option (see above), the plugin needs to
>>> >> >> > perform
>>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>>> >> >> > function
>>> >> >> > map from the linked modules, without actually performing the
>>> >> >> > normal
>>> >> >> > link. Corresponding support should be added to the standalone
>>> >> >> > llvm-lto
>>> >> >> > tool to enable testing/debugging without involving the linker and
>>> >> >> > plugin.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. ThinLTO backend support:
>>> >> >> >
>>> >> >> > Support for invoking a phase-3 backend invocation (including
>>> >> >> > importing) on a module should be added to the ‘opt’ tool under
>>> >> >> > the
>>> >> >> > new
>>> >> >> > option. The main change under the option is to instantiate a
>>> >> >> > Linker
>>> >> >> > object used to manage the process of linking imported functions
>>> >> >> > into
>>> >> >> > the module, efficient read of the combined function map, and
>>> >> >> > enable
>>> >> >> > the ThinLTO import pass.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Function index/summary support:
>>> >> >> >
>>> >> >> > This includes infrastructure for writing and reading the function
>>> >> >> > index/summary section. As noted earlier this will be encoded in a
>>> >> >> > special ELF section within the module, alongside the .llvmbc
>>> >> >> > section
>>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>>> >> >> > ThinLTO simply contains all of the function index/summary
>>> >> >> > sections
>>> >> >> > across the linked modules, organized for efficient function
>>> >> >> > lookup.
>>> >> >> >
>>> >> >> > Each function available for importing from the module contains an
>>> >> >> > entry in the module’s function index/summary section and in the
>>> >> >> > resulting combined function map. Each function entry contains
>>> >> >> > that
>>> >> >> > function’s offset within the bitcode file, used to efficiently
>>> >> >> > locate
>>> >> >> > and quickly import just that function. The entry also contains
>>> >> >> > summary
>>> >> >> > information (e.g. basic information determined during parsing
>>> >> >> > such as
>>> >> >> > the number of instructions in the function), that will be used to
>>> >> >> > help
>>> >> >> > guide later import decisions. Because the contents of this
>>> >> >> > section
>>> >> >> > will change frequently during ThinLTO tuning, it should also be
>>> >> >> > marked
>>> >> >> > with a version id for backwards compatibility or version
>>> >> >> > checking.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. ThinLTO importing support:
>>> >> >> >
>>> >> >> > Support for the mechanics of importing functions from other
>>> >> >> > modules,
>>> >> >> > which can go in gradually as a set of patches since it will be
>>> >> >> > off by
>>> >> >> > default. Separate patches can include:
>>> >> >> >
>>> >> >> > - BitcodeReader changes to use function index to
>>> >> >> > import/deserialize
>>> >> >> > single function of interest (small changes, leverages existing
>>> >> >> > lazy
>>> >> >> > streamer support).
>>> >> >> >
>>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>>> >> >> > and
>>> >> >> > its index into bitcode reader.
>>> >> >> >
>>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>> >> >> > symbol
>>> >> >> > linking and global DCE, for example). This can be in-memory
>>> >> >> > initially,
>>> >> >> > but IR support may be required in order to support streaming
>>> >> >> > bitcode
>>> >> >> > out and back in again after importing.
>>> >> >> >
>>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>>> >> >> > static promotion when necessary. The linkage type of imported
>>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>>> >> >> > Statics
>>> >> >> > must be promoted in certain cases, and renamed in consistent
>>> >> >> > ways.
>>> >> >> >
>>> >> >> > - GlobalDCE changes to support removing imported functions that
>>> >> >> > were
>>> >> >> > not inlined (very small changes to existing pass logic).
>>> >> >> >
>>> >> >> >
>>> >> >> > f. ThinLTO Import Driver SCC pass:
>>> >> >> >
>>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>>> >> >> > via
>>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>> >> >> > includes
>>> >> >> > utilizing the thin archive (global function index/summary),
>>> >> >> > import
>>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>> >> >> > routines
>>> >> >> > that perform the import, and any necessary callgraph updates and
>>> >> >> > verification.
>>> >> >> >
>>> >> >> >
>>> >> >> > g. Backend Driver:
>>> >> >> >
>>> >> >> > For a single node build, the gold plugin can simply write a
>>> >> >> > makefile
>>> >> >> > and fork the parallel backend instances directly via parallel
>>> >> >> > make.
>>> >> >> >
>>> >> >> >
>>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>> >> >> > ----------------------------------------------------------------
>>> >> >> >
>>> >> >> > This refers to the patches that are not required for ThinLTO to
>>> >> >> > work,
>>> >> >> > but rather to improve compile time, memory, run-time performance
>>> >> >> > and
>>> >> >> > usability.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Lazy Debug Metadata Linking:
>>> >> >> >
>>> >> >> > The prototype implementation included lazy importing of
>>> >> >> > module-level
>>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>>> >> >> > function
>>> >> >> > importing is complete). This actually applies to all module-level
>>> >> >> > metadata, not just debug, although it is the largest. This can be
>>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>>> >> >> > ValueMapper, ModuleLinker
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Import Tuning:
>>> >> >> >
>>> >> >> > Tuning the import strategy will be an iterative process that will
>>> >> >> > continue to be refined over time. It involves several different
>>> >> >> > types
>>> >> >> > of changes: adding support for recording additional metrics in
>>> >> >> > the
>>> >> >> > function summary, such as profile data and optional
>>> >> >> > heavier-weight
>>> >> >> > IPA
>>> >> >> > analyses, and tuning the import heuristics based on the summary
>>> >> >> > and
>>> >> >> > callsite context.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. Combined Function Map Pruning:
>>> >> >> >
>>> >> >> > The combined function map can be pruned of functions that are
>>> >> >> > unlikely
>>> >> >> > to benefit from being imported. For example, during the phase-2
>>> >> >> > thin
>>> >> >> > archive plug step we can safely omit large and (with profile
>>> >> >> > data)
>>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>>> >> >> > Additionally, all but one copy of comdat functions can be
>>> >> >> > suppressed.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Distributed Build System Integration:
>>> >> >> >
>>> >> >> > For a distributed build system, the gold plugin should write the
>>> >> >> > parallel backend invocations into a makefile, including the
>>> >> >> > mapping
>>> >> >> > from the IR file to the real object file path, and exit.
>>> >> >> > Additional
>>> >> >> > work needs to be done in the distributed build system itself to
>>> >> >> > distribute and dispatch the parallel backend jobs to the build
>>> >> >> > cluster.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>> >> >> >
>>> >> >> > In order to support build systems that stage from local disks or
>>> >> >> > network storage, the plugin will optionally support computation
>>> >> >> > of
>>> >> >> > dependent sets of IR files that each module may import from. This
>>> >> >> > can
>>> >> >> > be computed from profile data, if it exists, or from the symbol
>>> >> >> > table
>>> >> >> > and heuristics if not. These dependence sets also enable support
>>> >> >> > for
>>> >> >> > incremental backend compiles.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>> >> >> > 408-460-2413
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > LLVM Developers mailing list
>>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> LLVM Developers mailing list
>>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>> >> 408-460-2413
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Xinliang David Li-2
that is exactly the point.

thanks,

David


On Thu, May 14, 2015 at 11:34 AM, Daniel Berlin <[hidden email]> wrote:
On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
> I'm not sure this is a particularly great assumption to make.

Which part?

>  We have to
> support a lot of different build systems and tools and concentrating on
> something that just binutils uses isn't particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to *all of* these tools.
You are saying "we should be friendly to everyone". He is saying the same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.


> I also
> can't imagine how it's necessary for any of the lto aspects as currently
> written in the proposal.
>
> -eric
>
> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>> David
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>>
>>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>>> wrote:
>>> > So, what Alex is saying is that we have these tools as well and they
>>> > understand bitcode just fine, as well as every object format - not just
>>> > ELF.
>>> > :)
>>>
>>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>> handle bitcode similarly to the way the standard tool + plugin does.
>>> But the goal we are trying to achieve is to allow the standard system
>>> versions of the tools to handle these files without requiring a
>>> plugin. I know the LLVM tool handles other object formats, but I'm not
>>> sure how that helps here? We're not planning to replace those tools,
>>> just allow the standard system versions to handle the intermediate
>>> objects produced by ThinLTO.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > -eric
>>> >
>>> >
>>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>>> > wrote:
>>> >>
>>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>> >> <[hidden email]> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>> >> > <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>> >> >>
>>> >> >> What about ar, nm, and various ld implementations adds this
>>> >> >> requirement?
>>> >> >> What about the LLVM implementations of these tools is lacking?
>>> >> >
>>> >> >
>>> >> > Sorry I can not parse your questions properly. Can you make it
>>> >> > clearer?
>>> >>
>>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>> >> bitcode that makes using elf-wrapped bitcode easier.
>>> >>
>>> >> The issue is that generally you need to provide a plugin to these
>>> >> tools in order for them to understand and handle bitcode files. We'd
>>> >> like standard tools to work without requiring a plugin as much as
>>> >> possible. And in some cases we want them to be handled different than
>>> >> the way bitcode files are handled with the plugin.
>>> >>
>>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>> >> provided the gold plugin it can emit the symbols.
>>> >>
>>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>>> >> without an index, so it can't be handled by the linker even with a
>>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>>> >> create an index, so the linker + gold plugin handle it appropriately
>>> >> on an -flto link.
>>> >>
>>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>> >> provided the gold plugin, it handles them but compiles them all the
>>> >> way through to ELF executable instructions via a partial LTO link.
>>> >> This is where we would like to differ in behavior (while also not
>>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>>> >> until the full link step.
>>> >>
>>> >> Let me know if that helps address your concerns.
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Alex
>>> >> >>
>>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>> >> >> > <[hidden email]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>> >> >> > looking
>>> >> >> > forward to feedback and questions.
>>> >> >> > Thanks!
>>> >> >> > Teresa
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>> >> >> > Background
>>> >> >> > can
>>> >> >> > be found in slides from EuroLLVM 2015:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>> >> >> > As described in the talk, we have a prototype implementation, and
>>> >> >> > would like to start staging patches upstream. This RFC describes
>>> >> >> > a
>>> >> >> > breakdown of the major pieces. We would like to commit upstream
>>> >> >> > gradually in several stages, with all functionality off by
>>> >> >> > default.
>>> >> >> > The core ThinLTO importing support and tuning will require
>>> >> >> > frequent
>>> >> >> > change and iteration during testing and tuning, and for that part
>>> >> >> > we
>>> >> >> > would like to commit rapidly (off by default). See the proposed
>>> >> >> > staged
>>> >> >> > implementation described in the Implementation Plan section.
>>> >> >> >
>>> >> >> >
>>> >> >> > ThinLTO Overview
>>> >> >> > ==============
>>> >> >> >
>>> >> >> > See the talk slides linked above for more details. The following
>>> >> >> > is a
>>> >> >> > high-level overview of the motivation.
>>> >> >> >
>>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>> >> >> > improving
>>> >> >> > runtime performance, by extending the scope of optimizations
>>> >> >> > across
>>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>>> >> >> > optimizing within the scope of single source modules. Two
>>> >> >> > solutions
>>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>> >> >> > currently
>>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>> >> >> > Optimization (LIPO). However, each of these solutions has
>>> >> >> > limitations
>>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>>> >> >> > approach that attempts to address these limitations, with a goal
>>> >> >> > of
>>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>>> >> >> > same
>>> >> >> > principals as LIPO, and therefore its advantages, without any of
>>> >> >> > its
>>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>>> >> >> > is
>>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>>> >> >> > compile time, but in a lazy mode that facilitates large scale
>>> >> >> > parallelism. The serial linker plugin phase is designed to be
>>> >> >> > razor
>>> >> >> > thin and blazingly fast. By default this step only does minimal
>>> >> >> > preparation work to enable the parallel lazy importing performed
>>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>> >> >> > enabling
>>> >> >> > CMO on machines without large memory configurations, while also
>>> >> >> > integrating well with distributed build systems. Results from
>>> >> >> > early
>>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>> >> >> > expectations that ThinLTO can scale like O2 while enabling much
>>> >> >> > of
>>> >> >> > the
>>> >> >> > CMO performed during a full LTO build.
>>> >> >> >
>>> >> >> >
>>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to
>>> >> >> > in
>>> >> >> > the
>>> >> >> > following implementation plan:
>>> >> >> >
>>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>> >> >> >
>>> >> >> >
>>> >> >> > Implementation Plan
>>> >> >> > ================
>>> >> >> >
>>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>>> >> >> > that
>>> >> >> > will be added, in roughly the order that the patches would be
>>> >> >> > staged.
>>> >> >> > The patches are divided into three stages. The first stage
>>> >> >> > contains a
>>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>>> >> >> > The
>>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>>> >> >> > which
>>> >> >> > will be off by default. The third stage includes
>>> >> >> > enhancements/improvements/tunings that can be performed after the
>>> >> >> > main
>>> >> >> > ThinLTO infrastructure is in.
>>> >> >> >
>>> >> >> > The second and third implementation stages will initially be very
>>> >> >> > volatile, requiring a lot of iterations and tuning with large
>>> >> >> > apps to
>>> >> >> > get stabilized. Therefore it will be important to do fast commits
>>> >> >> > for
>>> >> >> > these implementation stages.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1. Stage 1: Preparation
>>> >> >> > -------------------------------
>>> >> >> >
>>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>>> >> >> >
>>> >> >> >
>>> >> >> > a. LTO directory structure:
>>> >> >> >
>>> >> >> > Restructure the LTO directory to remove circular dependence when
>>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>>> >> >> > pass
>>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>> >> >> > linking
>>> >> >> > in functions from modules, IPO then requires the LTO library.
>>> >> >> > This
>>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>>> >> >> > we
>>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>>> >> >> > and
>>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>> >> >> > removing
>>> >> >> > the circular dependence.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. ELF wrapper generation support:
>>> >> >> >
>>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>>> >> >> > interact
>>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>>> >> >> > phase-1
>>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>> >> >> > symbol
>>> >> >> > table. The goal is both to interact with these tools without
>>> >> >> > requiring
>>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>> >> >> > files
>>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>>> >> >> > step).
>>> >> >> > I will send a separate design document for these changes, but the
>>> >> >> > following is a high-level overview.
>>> >> >> >
>>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>>> >> >> > to
>>> >> >> > add support for optionally generating bitcode in an ELF file
>>> >> >> > containing a single .llvmbc section holding the bitcode.
>>> >> >> > Specifically,
>>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>>> >> >> > and
>>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>>> >> >> > Eventually these would be automatically triggered under
>>> >> >> > “-fthinlto
>>> >> >> > -c”
>>> >> >> > and “-fthinlto -S”, respectively.
>>> >> >> >
>>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>>> >> >> > holding the function symbols within the bitcode. This facilitates
>>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>>> >> >> > since
>>> >> >> > the archive will have a symbol table as well. The archive symbol
>>> >> >> > table
>>> >> >> > enables gold to extract and pass to the plugin the constituent
>>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>>> >> >> > section
>>> >> >> > generated by “$LD -r”, some handling needs to be added to gold
>>> >> >> > and to
>>> >> >> > the backend driver to process each original module’s bitcode.
>>> >> >> >
>>> >> >> > The function index/summary will later be added as a special ELF
>>> >> >> > section alongside the .llvmbc sections.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>> >> >> > ----------------------------------------------
>>> >> >> >
>>> >> >> > The next set of patches adds the base implementation of the
>>> >> >> > ThinLTO
>>> >> >> > infrastructure, specifically those required to make ThinLTO
>>> >> >> > functional
>>> >> >> > and generate correct but not necessarily high-performing
>>> >> >> > binaries. It
>>> >> >> > also does not include support to make debug support under -g
>>> >> >> > efficient
>>> >> >> > with ThinLTO.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Clang/LLVM/gold linker options:
>>> >> >> >
>>> >> >> > An early set of clang/llvm patches is needed to provide options
>>> >> >> > to
>>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>> >> >> > implementation can be disabled by default as it is added.
>>> >> >> > Specifically, clang options -fthinlto (used instead of -flto)
>>> >> >> > will
>>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>>> >> >> > function summary/index on a compile step, and pass the
>>> >> >> > appropriate
>>> >> >> > option to the gold plugin on a link step. The -thinlto option
>>> >> >> > will be
>>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>>> >> >> > thin
>>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>>> >> >> > tool
>>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>> >> >> >
>>> >> >> > Under the new plugin option (see above), the plugin needs to
>>> >> >> > perform
>>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>>> >> >> > function
>>> >> >> > map from the linked modules, without actually performing the
>>> >> >> > normal
>>> >> >> > link. Corresponding support should be added to the standalone
>>> >> >> > llvm-lto
>>> >> >> > tool to enable testing/debugging without involving the linker and
>>> >> >> > plugin.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. ThinLTO backend support:
>>> >> >> >
>>> >> >> > Support for invoking a phase-3 backend invocation (including
>>> >> >> > importing) on a module should be added to the ‘opt’ tool under
>>> >> >> > the
>>> >> >> > new
>>> >> >> > option. The main change under the option is to instantiate a
>>> >> >> > Linker
>>> >> >> > object used to manage the process of linking imported functions
>>> >> >> > into
>>> >> >> > the module, efficient read of the combined function map, and
>>> >> >> > enable
>>> >> >> > the ThinLTO import pass.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Function index/summary support:
>>> >> >> >
>>> >> >> > This includes infrastructure for writing and reading the function
>>> >> >> > index/summary section. As noted earlier this will be encoded in a
>>> >> >> > special ELF section within the module, alongside the .llvmbc
>>> >> >> > section
>>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>>> >> >> > ThinLTO simply contains all of the function index/summary
>>> >> >> > sections
>>> >> >> > across the linked modules, organized for efficient function
>>> >> >> > lookup.
>>> >> >> >
>>> >> >> > Each function available for importing from the module contains an
>>> >> >> > entry in the module’s function index/summary section and in the
>>> >> >> > resulting combined function map. Each function entry contains
>>> >> >> > that
>>> >> >> > function’s offset within the bitcode file, used to efficiently
>>> >> >> > locate
>>> >> >> > and quickly import just that function. The entry also contains
>>> >> >> > summary
>>> >> >> > information (e.g. basic information determined during parsing
>>> >> >> > such as
>>> >> >> > the number of instructions in the function), that will be used to
>>> >> >> > help
>>> >> >> > guide later import decisions. Because the contents of this
>>> >> >> > section
>>> >> >> > will change frequently during ThinLTO tuning, it should also be
>>> >> >> > marked
>>> >> >> > with a version id for backwards compatibility or version
>>> >> >> > checking.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. ThinLTO importing support:
>>> >> >> >
>>> >> >> > Support for the mechanics of importing functions from other
>>> >> >> > modules,
>>> >> >> > which can go in gradually as a set of patches since it will be
>>> >> >> > off by
>>> >> >> > default. Separate patches can include:
>>> >> >> >
>>> >> >> > - BitcodeReader changes to use function index to
>>> >> >> > import/deserialize
>>> >> >> > single function of interest (small changes, leverages existing
>>> >> >> > lazy
>>> >> >> > streamer support).
>>> >> >> >
>>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>>> >> >> > and
>>> >> >> > its index into bitcode reader.
>>> >> >> >
>>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>> >> >> > symbol
>>> >> >> > linking and global DCE, for example). This can be in-memory
>>> >> >> > initially,
>>> >> >> > but IR support may be required in order to support streaming
>>> >> >> > bitcode
>>> >> >> > out and back in again after importing.
>>> >> >> >
>>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>>> >> >> > static promotion when necessary. The linkage type of imported
>>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>>> >> >> > Statics
>>> >> >> > must be promoted in certain cases, and renamed in consistent
>>> >> >> > ways.
>>> >> >> >
>>> >> >> > - GlobalDCE changes to support removing imported functions that
>>> >> >> > were
>>> >> >> > not inlined (very small changes to existing pass logic).
>>> >> >> >
>>> >> >> >
>>> >> >> > f. ThinLTO Import Driver SCC pass:
>>> >> >> >
>>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>>> >> >> > via
>>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>> >> >> > includes
>>> >> >> > utilizing the thin archive (global function index/summary),
>>> >> >> > import
>>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>> >> >> > routines
>>> >> >> > that perform the import, and any necessary callgraph updates and
>>> >> >> > verification.
>>> >> >> >
>>> >> >> >
>>> >> >> > g. Backend Driver:
>>> >> >> >
>>> >> >> > For a single node build, the gold plugin can simply write a
>>> >> >> > makefile
>>> >> >> > and fork the parallel backend instances directly via parallel
>>> >> >> > make.
>>> >> >> >
>>> >> >> >
>>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>> >> >> > ----------------------------------------------------------------
>>> >> >> >
>>> >> >> > This refers to the patches that are not required for ThinLTO to
>>> >> >> > work,
>>> >> >> > but rather to improve compile time, memory, run-time performance
>>> >> >> > and
>>> >> >> > usability.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Lazy Debug Metadata Linking:
>>> >> >> >
>>> >> >> > The prototype implementation included lazy importing of
>>> >> >> > module-level
>>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>>> >> >> > function
>>> >> >> > importing is complete). This actually applies to all module-level
>>> >> >> > metadata, not just debug, although it is the largest. This can be
>>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>>> >> >> > ValueMapper, ModuleLinker
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Import Tuning:
>>> >> >> >
>>> >> >> > Tuning the import strategy will be an iterative process that will
>>> >> >> > continue to be refined over time. It involves several different
>>> >> >> > types
>>> >> >> > of changes: adding support for recording additional metrics in
>>> >> >> > the
>>> >> >> > function summary, such as profile data and optional
>>> >> >> > heavier-weight
>>> >> >> > IPA
>>> >> >> > analyses, and tuning the import heuristics based on the summary
>>> >> >> > and
>>> >> >> > callsite context.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. Combined Function Map Pruning:
>>> >> >> >
>>> >> >> > The combined function map can be pruned of functions that are
>>> >> >> > unlikely
>>> >> >> > to benefit from being imported. For example, during the phase-2
>>> >> >> > thin
>>> >> >> > archive plug step we can safely omit large and (with profile
>>> >> >> > data)
>>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>>> >> >> > Additionally, all but one copy of comdat functions can be
>>> >> >> > suppressed.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Distributed Build System Integration:
>>> >> >> >
>>> >> >> > For a distributed build system, the gold plugin should write the
>>> >> >> > parallel backend invocations into a makefile, including the
>>> >> >> > mapping
>>> >> >> > from the IR file to the real object file path, and exit.
>>> >> >> > Additional
>>> >> >> > work needs to be done in the distributed build system itself to
>>> >> >> > distribute and dispatch the parallel backend jobs to the build
>>> >> >> > cluster.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>> >> >> >
>>> >> >> > In order to support build systems that stage from local disks or
>>> >> >> > network storage, the plugin will optionally support computation
>>> >> >> > of
>>> >> >> > dependent sets of IR files that each module may import from. This
>>> >> >> > can
>>> >> >> > be computed from profile data, if it exists, or from the symbol
>>> >> >> > table
>>> >> >> > and heuristics if not. These dependence sets also enable support
>>> >> >> > for
>>> >> >> > incremental backend compiles.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>> >> >> > <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > LLVM Developers mailing list
>>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> LLVM Developers mailing list
>>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>> >> <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413">408-460-2413
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Xinliang David Li-2
In reply to this post by Eric Christopher
The end goal is the ability to turn on thin-lto as easy as turning optimizations like -O2 or -O3 -- we want friendliness, very much :)

David


On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
I'm not sure this is a particularly great assumption to make. We have to support a lot of different build systems and tools and concentrating on something that just binutils uses isn't particularly friendly here. I also can't imagine how it's necessary for any of the lto aspects as currently written in the proposal.

-eric

On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]> wrote:
The design objective is to make thinLTO mostly transparent to binutil tools to enable easy integration with any build system in the wild.  'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another reason.

David

On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]> wrote:
On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa

>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Robinson, Paul-3

The friendliest tactic would be to support all object-file formats, not just ELF?

--paulr

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Xinliang David Li
Sent: Thursday, May 14, 2015 11:54 AM
To: Eric Christopher
Cc: <[hidden email]> List
Subject: Re: [LLVMdev] RFC: ThinLTO Impementation Plan

 

The end goal is the ability to turn on thin-lto as easy as turning optimizations like -O2 or -O3 -- we want friendliness, very much :)

 

David

 

 

On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:

I'm not sure this is a particularly great assumption to make. We have to support a lot of different build systems and tools and concentrating on something that just binutils uses isn't particularly friendly here. I also can't imagine how it's necessary for any of the lto aspects as currently written in the proposal.

 

-eric

 

On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]> wrote:

The design objective is to make thinLTO mostly transparent to binutil tools to enable easy integration with any build system in the wild.  'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another reason.

 

David

 

On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]> wrote:

On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa


>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > <a href="tel:408-460-2413" target="_blank">408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" target="_blank">408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" target="_blank">408-460-2413

 

 


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Eric Christopher
In reply to this post by Daniel Berlin


On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]> wrote:
On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
> I'm not sure this is a particularly great assumption to make.

Which part?

The binutils part :)
 

>  We have to
> support a lot of different build systems and tools and concentrating on
> something that just binutils uses isn't particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to *all of* these tools.
You are saying "we should be friendly to everyone". He is saying the same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.

Oh, I understood. I just don't know that I agree. To do anything with the tools will require some knowledge of bitcode anyhow or need the plugin. I'm saying that as a baseline start we should look at how to do this using the tools we've got rather than wrapping things for no real gain.

I've talked to Teresa a bit offline and we're going to talk more later (and discuss on the list), but there are some discussions about how to make this work either with just bitcode/llvm tools and so not requiring integration on all platforms. The latter is what I consider as particularly friendly :)

-eric
 


> I also
> can't imagine how it's necessary for any of the lto aspects as currently
> written in the proposal.
>
> -eric
>
> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>> David
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>>
>>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>>> wrote:
>>> > So, what Alex is saying is that we have these tools as well and they
>>> > understand bitcode just fine, as well as every object format - not just
>>> > ELF.
>>> > :)
>>>
>>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>> handle bitcode similarly to the way the standard tool + plugin does.
>>> But the goal we are trying to achieve is to allow the standard system
>>> versions of the tools to handle these files without requiring a
>>> plugin. I know the LLVM tool handles other object formats, but I'm not
>>> sure how that helps here? We're not planning to replace those tools,
>>> just allow the standard system versions to handle the intermediate
>>> objects produced by ThinLTO.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > -eric
>>> >
>>> >
>>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>>> > wrote:
>>> >>
>>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>> >> <[hidden email]> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>> >> > <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>> >> >>
>>> >> >> What about ar, nm, and various ld implementations adds this
>>> >> >> requirement?
>>> >> >> What about the LLVM implementations of these tools is lacking?
>>> >> >
>>> >> >
>>> >> > Sorry I can not parse your questions properly. Can you make it
>>> >> > clearer?
>>> >>
>>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>> >> bitcode that makes using elf-wrapped bitcode easier.
>>> >>
>>> >> The issue is that generally you need to provide a plugin to these
>>> >> tools in order for them to understand and handle bitcode files. We'd
>>> >> like standard tools to work without requiring a plugin as much as
>>> >> possible. And in some cases we want them to be handled different than
>>> >> the way bitcode files are handled with the plugin.
>>> >>
>>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>> >> provided the gold plugin it can emit the symbols.
>>> >>
>>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>>> >> without an index, so it can't be handled by the linker even with a
>>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>>> >> create an index, so the linker + gold plugin handle it appropriately
>>> >> on an -flto link.
>>> >>
>>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>> >> provided the gold plugin, it handles them but compiles them all the
>>> >> way through to ELF executable instructions via a partial LTO link.
>>> >> This is where we would like to differ in behavior (while also not
>>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>>> >> until the full link step.
>>> >>
>>> >> Let me know if that helps address your concerns.
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Alex
>>> >> >>
>>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>> >> >> > <[hidden email]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>> >> >> > looking
>>> >> >> > forward to feedback and questions.
>>> >> >> > Thanks!
>>> >> >> > Teresa
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>> >> >> > Background
>>> >> >> > can
>>> >> >> > be found in slides from EuroLLVM 2015:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>> >> >> > As described in the talk, we have a prototype implementation, and
>>> >> >> > would like to start staging patches upstream. This RFC describes
>>> >> >> > a
>>> >> >> > breakdown of the major pieces. We would like to commit upstream
>>> >> >> > gradually in several stages, with all functionality off by
>>> >> >> > default.
>>> >> >> > The core ThinLTO importing support and tuning will require
>>> >> >> > frequent
>>> >> >> > change and iteration during testing and tuning, and for that part
>>> >> >> > we
>>> >> >> > would like to commit rapidly (off by default). See the proposed
>>> >> >> > staged
>>> >> >> > implementation described in the Implementation Plan section.
>>> >> >> >
>>> >> >> >
>>> >> >> > ThinLTO Overview
>>> >> >> > ==============
>>> >> >> >
>>> >> >> > See the talk slides linked above for more details. The following
>>> >> >> > is a
>>> >> >> > high-level overview of the motivation.
>>> >> >> >
>>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>> >> >> > improving
>>> >> >> > runtime performance, by extending the scope of optimizations
>>> >> >> > across
>>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>>> >> >> > optimizing within the scope of single source modules. Two
>>> >> >> > solutions
>>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>> >> >> > currently
>>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>> >> >> > Optimization (LIPO). However, each of these solutions has
>>> >> >> > limitations
>>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>>> >> >> > approach that attempts to address these limitations, with a goal
>>> >> >> > of
>>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>>> >> >> > same
>>> >> >> > principals as LIPO, and therefore its advantages, without any of
>>> >> >> > its
>>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>>> >> >> > is
>>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>>> >> >> > compile time, but in a lazy mode that facilitates large scale
>>> >> >> > parallelism. The serial linker plugin phase is designed to be
>>> >> >> > razor
>>> >> >> > thin and blazingly fast. By default this step only does minimal
>>> >> >> > preparation work to enable the parallel lazy importing performed
>>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>> >> >> > enabling
>>> >> >> > CMO on machines without large memory configurations, while also
>>> >> >> > integrating well with distributed build systems. Results from
>>> >> >> > early
>>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>> >> >> > expectations that ThinLTO can scale like O2 while enabling much
>>> >> >> > of
>>> >> >> > the
>>> >> >> > CMO performed during a full LTO build.
>>> >> >> >
>>> >> >> >
>>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to
>>> >> >> > in
>>> >> >> > the
>>> >> >> > following implementation plan:
>>> >> >> >
>>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>> >> >> >
>>> >> >> >
>>> >> >> > Implementation Plan
>>> >> >> > ================
>>> >> >> >
>>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>>> >> >> > that
>>> >> >> > will be added, in roughly the order that the patches would be
>>> >> >> > staged.
>>> >> >> > The patches are divided into three stages. The first stage
>>> >> >> > contains a
>>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>>> >> >> > The
>>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>>> >> >> > which
>>> >> >> > will be off by default. The third stage includes
>>> >> >> > enhancements/improvements/tunings that can be performed after the
>>> >> >> > main
>>> >> >> > ThinLTO infrastructure is in.
>>> >> >> >
>>> >> >> > The second and third implementation stages will initially be very
>>> >> >> > volatile, requiring a lot of iterations and tuning with large
>>> >> >> > apps to
>>> >> >> > get stabilized. Therefore it will be important to do fast commits
>>> >> >> > for
>>> >> >> > these implementation stages.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1. Stage 1: Preparation
>>> >> >> > -------------------------------
>>> >> >> >
>>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>>> >> >> >
>>> >> >> >
>>> >> >> > a. LTO directory structure:
>>> >> >> >
>>> >> >> > Restructure the LTO directory to remove circular dependence when
>>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>>> >> >> > pass
>>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>> >> >> > linking
>>> >> >> > in functions from modules, IPO then requires the LTO library.
>>> >> >> > This
>>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>>> >> >> > we
>>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>>> >> >> > and
>>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>> >> >> > removing
>>> >> >> > the circular dependence.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. ELF wrapper generation support:
>>> >> >> >
>>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>>> >> >> > interact
>>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>>> >> >> > phase-1
>>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>> >> >> > symbol
>>> >> >> > table. The goal is both to interact with these tools without
>>> >> >> > requiring
>>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>> >> >> > files
>>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>>> >> >> > step).
>>> >> >> > I will send a separate design document for these changes, but the
>>> >> >> > following is a high-level overview.
>>> >> >> >
>>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>>> >> >> > to
>>> >> >> > add support for optionally generating bitcode in an ELF file
>>> >> >> > containing a single .llvmbc section holding the bitcode.
>>> >> >> > Specifically,
>>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>>> >> >> > and
>>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>>> >> >> > Eventually these would be automatically triggered under
>>> >> >> > “-fthinlto
>>> >> >> > -c”
>>> >> >> > and “-fthinlto -S”, respectively.
>>> >> >> >
>>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>>> >> >> > holding the function symbols within the bitcode. This facilitates
>>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>>> >> >> > since
>>> >> >> > the archive will have a symbol table as well. The archive symbol
>>> >> >> > table
>>> >> >> > enables gold to extract and pass to the plugin the constituent
>>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>>> >> >> > section
>>> >> >> > generated by “$LD -r”, some handling needs to be added to gold
>>> >> >> > and to
>>> >> >> > the backend driver to process each original module’s bitcode.
>>> >> >> >
>>> >> >> > The function index/summary will later be added as a special ELF
>>> >> >> > section alongside the .llvmbc sections.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>> >> >> > ----------------------------------------------
>>> >> >> >
>>> >> >> > The next set of patches adds the base implementation of the
>>> >> >> > ThinLTO
>>> >> >> > infrastructure, specifically those required to make ThinLTO
>>> >> >> > functional
>>> >> >> > and generate correct but not necessarily high-performing
>>> >> >> > binaries. It
>>> >> >> > also does not include support to make debug support under -g
>>> >> >> > efficient
>>> >> >> > with ThinLTO.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Clang/LLVM/gold linker options:
>>> >> >> >
>>> >> >> > An early set of clang/llvm patches is needed to provide options
>>> >> >> > to
>>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>> >> >> > implementation can be disabled by default as it is added.
>>> >> >> > Specifically, clang options -fthinlto (used instead of -flto)
>>> >> >> > will
>>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>>> >> >> > function summary/index on a compile step, and pass the
>>> >> >> > appropriate
>>> >> >> > option to the gold plugin on a link step. The -thinlto option
>>> >> >> > will be
>>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>>> >> >> > thin
>>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>>> >> >> > tool
>>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>> >> >> >
>>> >> >> > Under the new plugin option (see above), the plugin needs to
>>> >> >> > perform
>>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>>> >> >> > function
>>> >> >> > map from the linked modules, without actually performing the
>>> >> >> > normal
>>> >> >> > link. Corresponding support should be added to the standalone
>>> >> >> > llvm-lto
>>> >> >> > tool to enable testing/debugging without involving the linker and
>>> >> >> > plugin.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. ThinLTO backend support:
>>> >> >> >
>>> >> >> > Support for invoking a phase-3 backend invocation (including
>>> >> >> > importing) on a module should be added to the ‘opt’ tool under
>>> >> >> > the
>>> >> >> > new
>>> >> >> > option. The main change under the option is to instantiate a
>>> >> >> > Linker
>>> >> >> > object used to manage the process of linking imported functions
>>> >> >> > into
>>> >> >> > the module, efficient read of the combined function map, and
>>> >> >> > enable
>>> >> >> > the ThinLTO import pass.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Function index/summary support:
>>> >> >> >
>>> >> >> > This includes infrastructure for writing and reading the function
>>> >> >> > index/summary section. As noted earlier this will be encoded in a
>>> >> >> > special ELF section within the module, alongside the .llvmbc
>>> >> >> > section
>>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>>> >> >> > ThinLTO simply contains all of the function index/summary
>>> >> >> > sections
>>> >> >> > across the linked modules, organized for efficient function
>>> >> >> > lookup.
>>> >> >> >
>>> >> >> > Each function available for importing from the module contains an
>>> >> >> > entry in the module’s function index/summary section and in the
>>> >> >> > resulting combined function map. Each function entry contains
>>> >> >> > that
>>> >> >> > function’s offset within the bitcode file, used to efficiently
>>> >> >> > locate
>>> >> >> > and quickly import just that function. The entry also contains
>>> >> >> > summary
>>> >> >> > information (e.g. basic information determined during parsing
>>> >> >> > such as
>>> >> >> > the number of instructions in the function), that will be used to
>>> >> >> > help
>>> >> >> > guide later import decisions. Because the contents of this
>>> >> >> > section
>>> >> >> > will change frequently during ThinLTO tuning, it should also be
>>> >> >> > marked
>>> >> >> > with a version id for backwards compatibility or version
>>> >> >> > checking.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. ThinLTO importing support:
>>> >> >> >
>>> >> >> > Support for the mechanics of importing functions from other
>>> >> >> > modules,
>>> >> >> > which can go in gradually as a set of patches since it will be
>>> >> >> > off by
>>> >> >> > default. Separate patches can include:
>>> >> >> >
>>> >> >> > - BitcodeReader changes to use function index to
>>> >> >> > import/deserialize
>>> >> >> > single function of interest (small changes, leverages existing
>>> >> >> > lazy
>>> >> >> > streamer support).
>>> >> >> >
>>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>>> >> >> > and
>>> >> >> > its index into bitcode reader.
>>> >> >> >
>>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>> >> >> > symbol
>>> >> >> > linking and global DCE, for example). This can be in-memory
>>> >> >> > initially,
>>> >> >> > but IR support may be required in order to support streaming
>>> >> >> > bitcode
>>> >> >> > out and back in again after importing.
>>> >> >> >
>>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>>> >> >> > static promotion when necessary. The linkage type of imported
>>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>>> >> >> > Statics
>>> >> >> > must be promoted in certain cases, and renamed in consistent
>>> >> >> > ways.
>>> >> >> >
>>> >> >> > - GlobalDCE changes to support removing imported functions that
>>> >> >> > were
>>> >> >> > not inlined (very small changes to existing pass logic).
>>> >> >> >
>>> >> >> >
>>> >> >> > f. ThinLTO Import Driver SCC pass:
>>> >> >> >
>>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>>> >> >> > via
>>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>> >> >> > includes
>>> >> >> > utilizing the thin archive (global function index/summary),
>>> >> >> > import
>>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>> >> >> > routines
>>> >> >> > that perform the import, and any necessary callgraph updates and
>>> >> >> > verification.
>>> >> >> >
>>> >> >> >
>>> >> >> > g. Backend Driver:
>>> >> >> >
>>> >> >> > For a single node build, the gold plugin can simply write a
>>> >> >> > makefile
>>> >> >> > and fork the parallel backend instances directly via parallel
>>> >> >> > make.
>>> >> >> >
>>> >> >> >
>>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>> >> >> > ----------------------------------------------------------------
>>> >> >> >
>>> >> >> > This refers to the patches that are not required for ThinLTO to
>>> >> >> > work,
>>> >> >> > but rather to improve compile time, memory, run-time performance
>>> >> >> > and
>>> >> >> > usability.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Lazy Debug Metadata Linking:
>>> >> >> >
>>> >> >> > The prototype implementation included lazy importing of
>>> >> >> > module-level
>>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>>> >> >> > function
>>> >> >> > importing is complete). This actually applies to all module-level
>>> >> >> > metadata, not just debug, although it is the largest. This can be
>>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>>> >> >> > ValueMapper, ModuleLinker
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Import Tuning:
>>> >> >> >
>>> >> >> > Tuning the import strategy will be an iterative process that will
>>> >> >> > continue to be refined over time. It involves several different
>>> >> >> > types
>>> >> >> > of changes: adding support for recording additional metrics in
>>> >> >> > the
>>> >> >> > function summary, such as profile data and optional
>>> >> >> > heavier-weight
>>> >> >> > IPA
>>> >> >> > analyses, and tuning the import heuristics based on the summary
>>> >> >> > and
>>> >> >> > callsite context.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. Combined Function Map Pruning:
>>> >> >> >
>>> >> >> > The combined function map can be pruned of functions that are
>>> >> >> > unlikely
>>> >> >> > to benefit from being imported. For example, during the phase-2
>>> >> >> > thin
>>> >> >> > archive plug step we can safely omit large and (with profile
>>> >> >> > data)
>>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>>> >> >> > Additionally, all but one copy of comdat functions can be
>>> >> >> > suppressed.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Distributed Build System Integration:
>>> >> >> >
>>> >> >> > For a distributed build system, the gold plugin should write the
>>> >> >> > parallel backend invocations into a makefile, including the
>>> >> >> > mapping
>>> >> >> > from the IR file to the real object file path, and exit.
>>> >> >> > Additional
>>> >> >> > work needs to be done in the distributed build system itself to
>>> >> >> > distribute and dispatch the parallel backend jobs to the build
>>> >> >> > cluster.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>> >> >> >
>>> >> >> > In order to support build systems that stage from local disks or
>>> >> >> > network storage, the plugin will optionally support computation
>>> >> >> > of
>>> >> >> > dependent sets of IR files that each module may import from. This
>>> >> >> > can
>>> >> >> > be computed from profile data, if it exists, or from the symbol
>>> >> >> > table
>>> >> >> > and heuristics if not. These dependence sets also enable support
>>> >> >> > for
>>> >> >> > incremental backend compiles.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>> >> >> > 408-460-2413
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > LLVM Developers mailing list
>>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> LLVM Developers mailing list
>>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>> >> 408-460-2413
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

David Blaikie


On Thu, May 14, 2015 at 12:53 PM, Eric Christopher <[hidden email]> wrote:


On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]> wrote:
On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
> I'm not sure this is a particularly great assumption to make.

Which part?

The binutils part :)
 

>  We have to
> support a lot of different build systems and tools and concentrating on
> something that just binutils uses isn't particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to *all of* these tools.
You are saying "we should be friendly to everyone". He is saying the same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.

Oh, I understood. I just don't know that I agree. To do anything with the tools will require some knowledge of bitcode anyhow or need the plugin. I'm saying that as a baseline start we should look at how to do this using the tools we've got rather than wrapping things for no real gain.

That doesn't seem strictly true - the ar situation (which I'm lead to believe is in use in our build system & others, one would assume). With the symbol table included as proposed, ar can be used without any knowledge of the bitcode or need for a plugin.

It'd be helpful to have the scenarios we're trying to support with these tools & then weigh up the alternatives.
 
I've talked to Teresa a bit offline and we're going to talk more later (and discuss on the list), but there are some discussions about how to make this work either with just bitcode/llvm tools and so not requiring integration on all platforms. The latter is what I consider as particularly friendly :)

-eric
 


> I also
> can't imagine how it's necessary for any of the lto aspects as currently
> written in the proposal.
>
> -eric
>
> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>> David
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>>
>>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>>> wrote:
>>> > So, what Alex is saying is that we have these tools as well and they
>>> > understand bitcode just fine, as well as every object format - not just
>>> > ELF.
>>> > :)
>>>
>>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>> handle bitcode similarly to the way the standard tool + plugin does.
>>> But the goal we are trying to achieve is to allow the standard system
>>> versions of the tools to handle these files without requiring a
>>> plugin. I know the LLVM tool handles other object formats, but I'm not
>>> sure how that helps here? We're not planning to replace those tools,
>>> just allow the standard system versions to handle the intermediate
>>> objects produced by ThinLTO.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > -eric
>>> >
>>> >
>>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>>> > wrote:
>>> >>
>>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>> >> <[hidden email]> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>> >> > <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>> >> >>
>>> >> >> What about ar, nm, and various ld implementations adds this
>>> >> >> requirement?
>>> >> >> What about the LLVM implementations of these tools is lacking?
>>> >> >
>>> >> >
>>> >> > Sorry I can not parse your questions properly. Can you make it
>>> >> > clearer?
>>> >>
>>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>> >> bitcode that makes using elf-wrapped bitcode easier.
>>> >>
>>> >> The issue is that generally you need to provide a plugin to these
>>> >> tools in order for them to understand and handle bitcode files. We'd
>>> >> like standard tools to work without requiring a plugin as much as
>>> >> possible. And in some cases we want them to be handled different than
>>> >> the way bitcode files are handled with the plugin.
>>> >>
>>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>> >> provided the gold plugin it can emit the symbols.
>>> >>
>>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>>> >> without an index, so it can't be handled by the linker even with a
>>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>>> >> create an index, so the linker + gold plugin handle it appropriately
>>> >> on an -flto link.
>>> >>
>>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>> >> provided the gold plugin, it handles them but compiles them all the
>>> >> way through to ELF executable instructions via a partial LTO link.
>>> >> This is where we would like to differ in behavior (while also not
>>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>>> >> until the full link step.
>>> >>
>>> >> Let me know if that helps address your concerns.
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Alex
>>> >> >>
>>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>> >> >> > <[hidden email]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>> >> >> > looking
>>> >> >> > forward to feedback and questions.
>>> >> >> > Thanks!
>>> >> >> > Teresa
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>> >> >> > Background
>>> >> >> > can
>>> >> >> > be found in slides from EuroLLVM 2015:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>> >> >> > As described in the talk, we have a prototype implementation, and
>>> >> >> > would like to start staging patches upstream. This RFC describes
>>> >> >> > a
>>> >> >> > breakdown of the major pieces. We would like to commit upstream
>>> >> >> > gradually in several stages, with all functionality off by
>>> >> >> > default.
>>> >> >> > The core ThinLTO importing support and tuning will require
>>> >> >> > frequent
>>> >> >> > change and iteration during testing and tuning, and for that part
>>> >> >> > we
>>> >> >> > would like to commit rapidly (off by default). See the proposed
>>> >> >> > staged
>>> >> >> > implementation described in the Implementation Plan section.
>>> >> >> >
>>> >> >> >
>>> >> >> > ThinLTO Overview
>>> >> >> > ==============
>>> >> >> >
>>> >> >> > See the talk slides linked above for more details. The following
>>> >> >> > is a
>>> >> >> > high-level overview of the motivation.
>>> >> >> >
>>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>> >> >> > improving
>>> >> >> > runtime performance, by extending the scope of optimizations
>>> >> >> > across
>>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>>> >> >> > optimizing within the scope of single source modules. Two
>>> >> >> > solutions
>>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>> >> >> > currently
>>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>> >> >> > Optimization (LIPO). However, each of these solutions has
>>> >> >> > limitations
>>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>>> >> >> > approach that attempts to address these limitations, with a goal
>>> >> >> > of
>>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>>> >> >> > same
>>> >> >> > principals as LIPO, and therefore its advantages, without any of
>>> >> >> > its
>>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>>> >> >> > is
>>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>>> >> >> > compile time, but in a lazy mode that facilitates large scale
>>> >> >> > parallelism. The serial linker plugin phase is designed to be
>>> >> >> > razor
>>> >> >> > thin and blazingly fast. By default this step only does minimal
>>> >> >> > preparation work to enable the parallel lazy importing performed
>>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>> >> >> > enabling
>>> >> >> > CMO on machines without large memory configurations, while also
>>> >> >> > integrating well with distributed build systems. Results from
>>> >> >> > early
>>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>> >> >> > expectations that ThinLTO can scale like O2 while enabling much
>>> >> >> > of
>>> >> >> > the
>>> >> >> > CMO performed during a full LTO build.
>>> >> >> >
>>> >> >> >
>>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to
>>> >> >> > in
>>> >> >> > the
>>> >> >> > following implementation plan:
>>> >> >> >
>>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>> >> >> >
>>> >> >> >
>>> >> >> > Implementation Plan
>>> >> >> > ================
>>> >> >> >
>>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>>> >> >> > that
>>> >> >> > will be added, in roughly the order that the patches would be
>>> >> >> > staged.
>>> >> >> > The patches are divided into three stages. The first stage
>>> >> >> > contains a
>>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>>> >> >> > The
>>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>>> >> >> > which
>>> >> >> > will be off by default. The third stage includes
>>> >> >> > enhancements/improvements/tunings that can be performed after the
>>> >> >> > main
>>> >> >> > ThinLTO infrastructure is in.
>>> >> >> >
>>> >> >> > The second and third implementation stages will initially be very
>>> >> >> > volatile, requiring a lot of iterations and tuning with large
>>> >> >> > apps to
>>> >> >> > get stabilized. Therefore it will be important to do fast commits
>>> >> >> > for
>>> >> >> > these implementation stages.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1. Stage 1: Preparation
>>> >> >> > -------------------------------
>>> >> >> >
>>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>>> >> >> >
>>> >> >> >
>>> >> >> > a. LTO directory structure:
>>> >> >> >
>>> >> >> > Restructure the LTO directory to remove circular dependence when
>>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>>> >> >> > pass
>>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>> >> >> > linking
>>> >> >> > in functions from modules, IPO then requires the LTO library.
>>> >> >> > This
>>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>>> >> >> > we
>>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>>> >> >> > and
>>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>> >> >> > removing
>>> >> >> > the circular dependence.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. ELF wrapper generation support:
>>> >> >> >
>>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>>> >> >> > interact
>>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>>> >> >> > phase-1
>>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>> >> >> > symbol
>>> >> >> > table. The goal is both to interact with these tools without
>>> >> >> > requiring
>>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>> >> >> > files
>>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>>> >> >> > step).
>>> >> >> > I will send a separate design document for these changes, but the
>>> >> >> > following is a high-level overview.
>>> >> >> >
>>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>>> >> >> > to
>>> >> >> > add support for optionally generating bitcode in an ELF file
>>> >> >> > containing a single .llvmbc section holding the bitcode.
>>> >> >> > Specifically,
>>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>>> >> >> > and
>>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>>> >> >> > Eventually these would be automatically triggered under
>>> >> >> > “-fthinlto
>>> >> >> > -c”
>>> >> >> > and “-fthinlto -S”, respectively.
>>> >> >> >
>>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>>> >> >> > holding the function symbols within the bitcode. This facilitates
>>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>>> >> >> > since
>>> >> >> > the archive will have a symbol table as well. The archive symbol
>>> >> >> > table
>>> >> >> > enables gold to extract and pass to the plugin the constituent
>>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>>> >> >> > section
>>> >> >> > generated by “$LD -r”, some handling needs to be added to gold
>>> >> >> > and to
>>> >> >> > the backend driver to process each original module’s bitcode.
>>> >> >> >
>>> >> >> > The function index/summary will later be added as a special ELF
>>> >> >> > section alongside the .llvmbc sections.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>> >> >> > ----------------------------------------------
>>> >> >> >
>>> >> >> > The next set of patches adds the base implementation of the
>>> >> >> > ThinLTO
>>> >> >> > infrastructure, specifically those required to make ThinLTO
>>> >> >> > functional
>>> >> >> > and generate correct but not necessarily high-performing
>>> >> >> > binaries. It
>>> >> >> > also does not include support to make debug support under -g
>>> >> >> > efficient
>>> >> >> > with ThinLTO.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Clang/LLVM/gold linker options:
>>> >> >> >
>>> >> >> > An early set of clang/llvm patches is needed to provide options
>>> >> >> > to
>>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>> >> >> > implementation can be disabled by default as it is added.
>>> >> >> > Specifically, clang options -fthinlto (used instead of -flto)
>>> >> >> > will
>>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>>> >> >> > function summary/index on a compile step, and pass the
>>> >> >> > appropriate
>>> >> >> > option to the gold plugin on a link step. The -thinlto option
>>> >> >> > will be
>>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>>> >> >> > thin
>>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>>> >> >> > tool
>>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>> >> >> >
>>> >> >> > Under the new plugin option (see above), the plugin needs to
>>> >> >> > perform
>>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>>> >> >> > function
>>> >> >> > map from the linked modules, without actually performing the
>>> >> >> > normal
>>> >> >> > link. Corresponding support should be added to the standalone
>>> >> >> > llvm-lto
>>> >> >> > tool to enable testing/debugging without involving the linker and
>>> >> >> > plugin.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. ThinLTO backend support:
>>> >> >> >
>>> >> >> > Support for invoking a phase-3 backend invocation (including
>>> >> >> > importing) on a module should be added to the ‘opt’ tool under
>>> >> >> > the
>>> >> >> > new
>>> >> >> > option. The main change under the option is to instantiate a
>>> >> >> > Linker
>>> >> >> > object used to manage the process of linking imported functions
>>> >> >> > into
>>> >> >> > the module, efficient read of the combined function map, and
>>> >> >> > enable
>>> >> >> > the ThinLTO import pass.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Function index/summary support:
>>> >> >> >
>>> >> >> > This includes infrastructure for writing and reading the function
>>> >> >> > index/summary section. As noted earlier this will be encoded in a
>>> >> >> > special ELF section within the module, alongside the .llvmbc
>>> >> >> > section
>>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>>> >> >> > ThinLTO simply contains all of the function index/summary
>>> >> >> > sections
>>> >> >> > across the linked modules, organized for efficient function
>>> >> >> > lookup.
>>> >> >> >
>>> >> >> > Each function available for importing from the module contains an
>>> >> >> > entry in the module’s function index/summary section and in the
>>> >> >> > resulting combined function map. Each function entry contains
>>> >> >> > that
>>> >> >> > function’s offset within the bitcode file, used to efficiently
>>> >> >> > locate
>>> >> >> > and quickly import just that function. The entry also contains
>>> >> >> > summary
>>> >> >> > information (e.g. basic information determined during parsing
>>> >> >> > such as
>>> >> >> > the number of instructions in the function), that will be used to
>>> >> >> > help
>>> >> >> > guide later import decisions. Because the contents of this
>>> >> >> > section
>>> >> >> > will change frequently during ThinLTO tuning, it should also be
>>> >> >> > marked
>>> >> >> > with a version id for backwards compatibility or version
>>> >> >> > checking.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. ThinLTO importing support:
>>> >> >> >
>>> >> >> > Support for the mechanics of importing functions from other
>>> >> >> > modules,
>>> >> >> > which can go in gradually as a set of patches since it will be
>>> >> >> > off by
>>> >> >> > default. Separate patches can include:
>>> >> >> >
>>> >> >> > - BitcodeReader changes to use function index to
>>> >> >> > import/deserialize
>>> >> >> > single function of interest (small changes, leverages existing
>>> >> >> > lazy
>>> >> >> > streamer support).
>>> >> >> >
>>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>>> >> >> > and
>>> >> >> > its index into bitcode reader.
>>> >> >> >
>>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>> >> >> > symbol
>>> >> >> > linking and global DCE, for example). This can be in-memory
>>> >> >> > initially,
>>> >> >> > but IR support may be required in order to support streaming
>>> >> >> > bitcode
>>> >> >> > out and back in again after importing.
>>> >> >> >
>>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>>> >> >> > static promotion when necessary. The linkage type of imported
>>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>>> >> >> > Statics
>>> >> >> > must be promoted in certain cases, and renamed in consistent
>>> >> >> > ways.
>>> >> >> >
>>> >> >> > - GlobalDCE changes to support removing imported functions that
>>> >> >> > were
>>> >> >> > not inlined (very small changes to existing pass logic).
>>> >> >> >
>>> >> >> >
>>> >> >> > f. ThinLTO Import Driver SCC pass:
>>> >> >> >
>>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>>> >> >> > via
>>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>> >> >> > includes
>>> >> >> > utilizing the thin archive (global function index/summary),
>>> >> >> > import
>>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>> >> >> > routines
>>> >> >> > that perform the import, and any necessary callgraph updates and
>>> >> >> > verification.
>>> >> >> >
>>> >> >> >
>>> >> >> > g. Backend Driver:
>>> >> >> >
>>> >> >> > For a single node build, the gold plugin can simply write a
>>> >> >> > makefile
>>> >> >> > and fork the parallel backend instances directly via parallel
>>> >> >> > make.
>>> >> >> >
>>> >> >> >
>>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>> >> >> > ----------------------------------------------------------------
>>> >> >> >
>>> >> >> > This refers to the patches that are not required for ThinLTO to
>>> >> >> > work,
>>> >> >> > but rather to improve compile time, memory, run-time performance
>>> >> >> > and
>>> >> >> > usability.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Lazy Debug Metadata Linking:
>>> >> >> >
>>> >> >> > The prototype implementation included lazy importing of
>>> >> >> > module-level
>>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>>> >> >> > function
>>> >> >> > importing is complete). This actually applies to all module-level
>>> >> >> > metadata, not just debug, although it is the largest. This can be
>>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>>> >> >> > ValueMapper, ModuleLinker
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Import Tuning:
>>> >> >> >
>>> >> >> > Tuning the import strategy will be an iterative process that will
>>> >> >> > continue to be refined over time. It involves several different
>>> >> >> > types
>>> >> >> > of changes: adding support for recording additional metrics in
>>> >> >> > the
>>> >> >> > function summary, such as profile data and optional
>>> >> >> > heavier-weight
>>> >> >> > IPA
>>> >> >> > analyses, and tuning the import heuristics based on the summary
>>> >> >> > and
>>> >> >> > callsite context.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. Combined Function Map Pruning:
>>> >> >> >
>>> >> >> > The combined function map can be pruned of functions that are
>>> >> >> > unlikely
>>> >> >> > to benefit from being imported. For example, during the phase-2
>>> >> >> > thin
>>> >> >> > archive plug step we can safely omit large and (with profile
>>> >> >> > data)
>>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>>> >> >> > Additionally, all but one copy of comdat functions can be
>>> >> >> > suppressed.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Distributed Build System Integration:
>>> >> >> >
>>> >> >> > For a distributed build system, the gold plugin should write the
>>> >> >> > parallel backend invocations into a makefile, including the
>>> >> >> > mapping
>>> >> >> > from the IR file to the real object file path, and exit.
>>> >> >> > Additional
>>> >> >> > work needs to be done in the distributed build system itself to
>>> >> >> > distribute and dispatch the parallel backend jobs to the build
>>> >> >> > cluster.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>> >> >> >
>>> >> >> > In order to support build systems that stage from local disks or
>>> >> >> > network storage, the plugin will optionally support computation
>>> >> >> > of
>>> >> >> > dependent sets of IR files that each module may import from. This
>>> >> >> > can
>>> >> >> > be computed from profile data, if it exists, or from the symbol
>>> >> >> > table
>>> >> >> > and heuristics if not. These dependence sets also enable support
>>> >> >> > for
>>> >> >> > incremental backend compiles.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>> >> >> > <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > LLVM Developers mailing list
>>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> LLVM Developers mailing list
>>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>> >> <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Xinliang David Li-2
In reply to this post by Robinson, Paul-3


On Thu, May 14, 2015 at 12:46 PM, Robinson, Paul <[hidden email]> wrote:

The friendliest tactic would be to support all object-file formats, not just ELF?


In general it should be wrapped in native object format -- and ELF will be a starting point. 

David
 

--paulr

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Xinliang David Li
Sent: Thursday, May 14, 2015 11:54 AM
To: Eric Christopher
Cc: <[hidden email]> List
Subject: Re: [LLVMdev] RFC: ThinLTO Impementation Plan

 

The end goal is the ability to turn on thin-lto as easy as turning optimizations like -O2 or -O3 -- we want friendliness, very much :)

 

David

 

 

On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:

I'm not sure this is a particularly great assumption to make. We have to support a lot of different build systems and tools and concentrating on something that just binutils uses isn't particularly friendly here. I also can't imagine how it's necessary for any of the lto aspects as currently written in the proposal.

 

-eric

 

On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]> wrote:

The design objective is to make thinLTO mostly transparent to binutil tools to enable easy integration with any build system in the wild.  'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another reason.

 

David

 

On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]> wrote:

On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]> wrote:
> So, what Alex is saying is that we have these tools as well and they
> understand bitcode just fine, as well as every object format - not just ELF.
> :)

Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
handle bitcode similarly to the way the standard tool + plugin does.
But the goal we are trying to achieve is to allow the standard system
versions of the tools to handle these files without requiring a
plugin. I know the LLVM tool handles other object formats, but I'm not
sure how that helps here? We're not planning to replace those tools,
just allow the standard system versions to handle the intermediate
objects produced by ThinLTO.

Thanks,
Teresa


>
> -eric
>
>
> On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]> wrote:
>>
>> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> <[hidden email]> wrote:
>> >
>> >
>> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <[hidden email]>
>> > wrote:
>> >>
>> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >>
>> >> What about ar, nm, and various ld implementations adds this
>> >> requirement?
>> >> What about the LLVM implementations of these tools is lacking?
>> >
>> >
>> > Sorry I can not parse your questions properly. Can you make it clearer?
>>
>> Alex is asking what the issue is with ar, nm, ld -r and regular
>> bitcode that makes using elf-wrapped bitcode easier.
>>
>> The issue is that generally you need to provide a plugin to these
>> tools in order for them to understand and handle bitcode files. We'd
>> like standard tools to work without requiring a plugin as much as
>> possible. And in some cases we want them to be handled different than
>> the way bitcode files are handled with the plugin.
>>
>> nm: Without a plugin, normal bitcode files are inscrutable. When
>> provided the gold plugin it can emit the symbols.
>>
>> ar: Without a plugin, it will create an archive of bitcode files, but
>> without an index, so it can't be handled by the linker even with a
>> plugin on an -flto link. When ar is provided the gold plugin it does
>> create an index, so the linker + gold plugin handle it appropriately
>> on an -flto link.
>>
>> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> provided the gold plugin, it handles them but compiles them all the
>> way through to ELF executable instructions via a partial LTO link.
>> This is where we would like to differ in behavior (while also not
>> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> until the full link step.
>>
>> Let me know if that helps address your concerns.
>>
>> Thanks,
>> Teresa
>>
>> >
>> > David
>> >
>> >>
>> >>
>> >> Alex
>> >>
>> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> > wrote:
>> >> >
>> >> > I've included below an RFC for implementing ThinLTO in LLVM, looking
>> >> > forward to feedback and questions.
>> >> > Thanks!
>> >> > Teresa
>> >> >
>> >> >
>> >> >
>> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> > can
>> >> > be found in slides from EuroLLVM 2015:
>> >> >
>> >> >
>> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> > As described in the talk, we have a prototype implementation, and
>> >> > would like to start staging patches upstream. This RFC describes a
>> >> > breakdown of the major pieces. We would like to commit upstream
>> >> > gradually in several stages, with all functionality off by default.
>> >> > The core ThinLTO importing support and tuning will require frequent
>> >> > change and iteration during testing and tuning, and for that part we
>> >> > would like to commit rapidly (off by default). See the proposed
>> >> > staged
>> >> > implementation described in the Implementation Plan section.
>> >> >
>> >> >
>> >> > ThinLTO Overview
>> >> > ==============
>> >> >
>> >> > See the talk slides linked above for more details. The following is a
>> >> > high-level overview of the motivation.
>> >> >
>> >> > Cross Module Optimization (CMO) is an effective means for improving
>> >> > runtime performance, by extending the scope of optimizations across
>> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> > optimizing within the scope of single source modules. Two solutions
>> >> > for enabling CMO are Link-Time Optimization (LTO), which is currently
>> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> > Optimization (LIPO). However, each of these solutions has limitations
>> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> > approach that attempts to address these limitations, with a goal of
>> >> > being enabled more broadly. ThinLTO is designed with many of the same
>> >> > principals as LIPO, and therefore its advantages, without any of its
>> >> > inherent weakness. Unlike in LIPO where the module group decision is
>> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> > compile time, but in a lazy mode that facilitates large scale
>> >> > parallelism. The serial linker plugin phase is designed to be razor
>> >> > thin and blazingly fast. By default this step only does minimal
>> >> > preparation work to enable the parallel lazy importing performed
>> >> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> >> > CMO on machines without large memory configurations, while also
>> >> > integrating well with distributed build systems. Results from early
>> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> > the
>> >> > CMO performed during a full LTO build.
>> >> >
>> >> >
>> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> > the
>> >> > following implementation plan:
>> >> >
>> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >
>> >> >
>> >> > Implementation Plan
>> >> > ================
>> >> >
>> >> > This section gives a high-level breakdown of the ThinLTO support that
>> >> > will be added, in roughly the order that the patches would be staged.
>> >> > The patches are divided into three stages. The first stage contains a
>> >> > minimal amount of preparation work that is not ThinLTO-specific. The
>> >> > second stage contains most of the infrastructure for ThinLTO, which
>> >> > will be off by default. The third stage includes
>> >> > enhancements/improvements/tunings that can be performed after the
>> >> > main
>> >> > ThinLTO infrastructure is in.
>> >> >
>> >> > The second and third implementation stages will initially be very
>> >> > volatile, requiring a lot of iterations and tuning with large apps to
>> >> > get stabilized. Therefore it will be important to do fast commits for
>> >> > these implementation stages.
>> >> >
>> >> >
>> >> > 1. Stage 1: Preparation
>> >> > -------------------------------
>> >> >
>> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >
>> >> >
>> >> > a. LTO directory structure:
>> >> >
>> >> > Restructure the LTO directory to remove circular dependence when
>> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> > pass
>> >> > within Transforms/IPO, and leverages the LTOModule class for linking
>> >> > in functions from modules, IPO then requires the LTO library. This
>> >> > creates a circular dependence between LTO and IPO. To break that, we
>> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> >> > the circular dependence.
>> >> >
>> >> >
>> >> > b. ELF wrapper generation support:
>> >> >
>> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> > interact
>> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> >> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> >> > table. The goal is both to interact with these tools without
>> >> > requiring
>> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>> >> > I will send a separate design document for these changes, but the
>> >> > following is a high-level overview.
>> >> >
>> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> >> > add support for optionally generating bitcode in an ELF file
>> >> > containing a single .llvmbc section holding the bitcode.
>> >> > Specifically,
>> >> > the patch would add new options “emit-llvm-bc-elf” (object file) and
>> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> > -c”
>> >> > and “-fthinlto -S”, respectively.
>> >> >
>> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> > holding the function symbols within the bitcode. This facilitates
>> >> > handling archives of the ELF-wrapped bitcode created with $AR, since
>> >> > the archive will have a symbol table as well. The archive symbol
>> >> > table
>> >> > enables gold to extract and pass to the plugin the constituent
>> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> >> > generated by “$LD -r”, some handling needs to be added to gold and to
>> >> > the backend driver to process each original module’s bitcode.
>> >> >
>> >> > The function index/summary will later be added as a special ELF
>> >> > section alongside the .llvmbc sections.
>> >> >
>> >> >
>> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> > ----------------------------------------------
>> >> >
>> >> > The next set of patches adds the base implementation of the ThinLTO
>> >> > infrastructure, specifically those required to make ThinLTO
>> >> > functional
>> >> > and generate correct but not necessarily high-performing binaries. It
>> >> > also does not include support to make debug support under -g
>> >> > efficient
>> >> > with ThinLTO.
>> >> >
>> >> >
>> >> > a. Clang/LLVM/gold linker options:
>> >> >
>> >> > An early set of clang/llvm patches is needed to provide options to
>> >> > enable ThinLTO (off by default), so that the rest of the
>> >> > implementation can be disabled by default as it is added.
>> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> > function summary/index on a compile step, and pass the appropriate
>> >> > option to the gold plugin on a link step. The -thinlto option will be
>> >> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> > tool
>> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >
>> >> >
>> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >
>> >> > Under the new plugin option (see above), the plugin needs to perform
>> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> > function
>> >> > map from the linked modules, without actually performing the normal
>> >> > link. Corresponding support should be added to the standalone
>> >> > llvm-lto
>> >> > tool to enable testing/debugging without involving the linker and
>> >> > plugin.
>> >> >
>> >> >
>> >> > c. ThinLTO backend support:
>> >> >
>> >> > Support for invoking a phase-3 backend invocation (including
>> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> > new
>> >> > option. The main change under the option is to instantiate a Linker
>> >> > object used to manage the process of linking imported functions into
>> >> > the module, efficient read of the combined function map, and enable
>> >> > the ThinLTO import pass.
>> >> >
>> >> >
>> >> > d. Function index/summary support:
>> >> >
>> >> > This includes infrastructure for writing and reading the function
>> >> > index/summary section. As noted earlier this will be encoded in a
>> >> > special ELF section within the module, alongside the .llvmbc section
>> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> > ThinLTO simply contains all of the function index/summary sections
>> >> > across the linked modules, organized for efficient function lookup.
>> >> >
>> >> > Each function available for importing from the module contains an
>> >> > entry in the module’s function index/summary section and in the
>> >> > resulting combined function map. Each function entry contains that
>> >> > function’s offset within the bitcode file, used to efficiently locate
>> >> > and quickly import just that function. The entry also contains
>> >> > summary
>> >> > information (e.g. basic information determined during parsing such as
>> >> > the number of instructions in the function), that will be used to
>> >> > help
>> >> > guide later import decisions. Because the contents of this section
>> >> > will change frequently during ThinLTO tuning, it should also be
>> >> > marked
>> >> > with a version id for backwards compatibility or version checking.
>> >> >
>> >> >
>> >> > e. ThinLTO importing support:
>> >> >
>> >> > Support for the mechanics of importing functions from other modules,
>> >> > which can go in gradually as a set of patches since it will be off by
>> >> > default. Separate patches can include:
>> >> >
>> >> > - BitcodeReader changes to use function index to import/deserialize
>> >> > single function of interest (small changes, leverages existing lazy
>> >> > streamer support).
>> >> >
>> >> > - Minor LTOModule changes to pass the ThinLTO function to import and
>> >> > its index into bitcode reader.
>> >> >
>> >> > - Marking of imported functions (for use in ThinLTO-specific symbol
>> >> > linking and global DCE, for example). This can be in-memory
>> >> > initially,
>> >> > but IR support may be required in order to support streaming bitcode
>> >> > out and back in again after importing.
>> >> >
>> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> > static promotion when necessary. The linkage type of imported
>> >> > functions changes to AvailableExternallyLinkage, for example. Statics
>> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >
>> >> > - GlobalDCE changes to support removing imported functions that were
>> >> > not inlined (very small changes to existing pass logic).
>> >> >
>> >> >
>> >> > f. ThinLTO Import Driver SCC pass:
>> >> >
>> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> >> > an SCC pass, enabled only under -fthinlto options. The pass includes
>> >> > utilizing the thin archive (global function index/summary), import
>> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> > that perform the import, and any necessary callgraph updates and
>> >> > verification.
>> >> >
>> >> >
>> >> > g. Backend Driver:
>> >> >
>> >> > For a single node build, the gold plugin can simply write a makefile
>> >> > and fork the parallel backend instances directly via parallel make.
>> >> >
>> >> >
>> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> > ----------------------------------------------------------------
>> >> >
>> >> > This refers to the patches that are not required for ThinLTO to work,
>> >> > but rather to improve compile time, memory, run-time performance and
>> >> > usability.
>> >> >
>> >> >
>> >> > a. Lazy Debug Metadata Linking:
>> >> >
>> >> > The prototype implementation included lazy importing of module-level
>> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> > function
>> >> > importing is complete). This actually applies to all module-level
>> >> > metadata, not just debug, although it is the largest. This can be
>> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> > ValueMapper, ModuleLinker
>> >> >
>> >> >
>> >> > b. Import Tuning:
>> >> >
>> >> > Tuning the import strategy will be an iterative process that will
>> >> > continue to be refined over time. It involves several different types
>> >> > of changes: adding support for recording additional metrics in the
>> >> > function summary, such as profile data and optional heavier-weight
>> >> > IPA
>> >> > analyses, and tuning the import heuristics based on the summary and
>> >> > callsite context.
>> >> >
>> >> >
>> >> > c. Combined Function Map Pruning:
>> >> >
>> >> > The combined function map can be pruned of functions that are
>> >> > unlikely
>> >> > to benefit from being imported. For example, during the phase-2 thin
>> >> > archive plug step we can safely omit large and (with profile data)
>> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> > Additionally, all but one copy of comdat functions can be suppressed.
>> >> >
>> >> >
>> >> > d. Distributed Build System Integration:
>> >> >
>> >> > For a distributed build system, the gold plugin should write the
>> >> > parallel backend invocations into a makefile, including the mapping
>> >> > from the IR file to the real object file path, and exit. Additional
>> >> > work needs to be done in the distributed build system itself to
>> >> > distribute and dispatch the parallel backend jobs to the build
>> >> > cluster.
>> >> >
>> >> >
>> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >
>> >> > In order to support build systems that stage from local disks or
>> >> > network storage, the plugin will optionally support computation of
>> >> > dependent sets of IR files that each module may import from. This can
>> >> > be computed from profile data, if it exists, or from the symbol table
>> >> > and heuristics if not. These dependence sets also enable support for
>> >> > incremental backend compiles.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> > <a href="tel:408-460-2413" target="_blank">408-460-2413
>> >> >
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" target="_blank">408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" target="_blank">408-460-2413

 

 



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Eric Christopher
In reply to this post by David Blaikie


On Thu, May 14, 2015 at 1:11 PM David Blaikie <[hidden email]> wrote:
On Thu, May 14, 2015 at 12:53 PM, Eric Christopher <[hidden email]> wrote:


On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]> wrote:
On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]> wrote:
> I'm not sure this is a particularly great assumption to make.

Which part?

The binutils part :)
 

>  We have to
> support a lot of different build systems and tools and concentrating on
> something that just binutils uses isn't particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to *all of* these tools.
You are saying "we should be friendly to everyone". He is saying the same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.

Oh, I understood. I just don't know that I agree. To do anything with the tools will require some knowledge of bitcode anyhow or need the plugin. I'm saying that as a baseline start we should look at how to do this using the tools we've got rather than wrapping things for no real gain.

That doesn't seem strictly true - the ar situation (which I'm lead to believe is in use in our build system & others, one would assume). With the symbol table included as proposed, ar can be used without any knowledge of the bitcode or need for a plugin.


For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld -r'? ;)
 
It'd be helpful to have the scenarios we're trying to support with these tools & then weigh up the alternatives.
 

Agreed. The ar situation is interesting because one thing we discussed after you wandered off was just adding a ToC section to bitcode as it is and then having the tools handle that. Would seem to accomplish at least the goals as I've seen them up to this point without worrying too much.

At any rate, I think this aspect of the proposal needs a bit of discussion and some mapping out of the pros and cons here.

-eric
 
I've talked to Teresa a bit offline and we're going to talk more later (and discuss on the list), but there are some discussions about how to make this work either with just bitcode/llvm tools and so not requiring integration on all platforms. The latter is what I consider as particularly friendly :)

-eric
 


> I also
> can't imagine how it's necessary for any of the lto aspects as currently
> written in the proposal.
>
> -eric
>
> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>> David
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>>
>>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>>> wrote:
>>> > So, what Alex is saying is that we have these tools as well and they
>>> > understand bitcode just fine, as well as every object format - not just
>>> > ELF.
>>> > :)
>>>
>>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>> handle bitcode similarly to the way the standard tool + plugin does.
>>> But the goal we are trying to achieve is to allow the standard system
>>> versions of the tools to handle these files without requiring a
>>> plugin. I know the LLVM tool handles other object formats, but I'm not
>>> sure how that helps here? We're not planning to replace those tools,
>>> just allow the standard system versions to handle the intermediate
>>> objects produced by ThinLTO.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > -eric
>>> >
>>> >
>>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>>> > wrote:
>>> >>
>>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>> >> <[hidden email]> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>> >> > <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>> >> >>
>>> >> >> What about ar, nm, and various ld implementations adds this
>>> >> >> requirement?
>>> >> >> What about the LLVM implementations of these tools is lacking?
>>> >> >
>>> >> >
>>> >> > Sorry I can not parse your questions properly. Can you make it
>>> >> > clearer?
>>> >>
>>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>> >> bitcode that makes using elf-wrapped bitcode easier.
>>> >>
>>> >> The issue is that generally you need to provide a plugin to these
>>> >> tools in order for them to understand and handle bitcode files. We'd
>>> >> like standard tools to work without requiring a plugin as much as
>>> >> possible. And in some cases we want them to be handled different than
>>> >> the way bitcode files are handled with the plugin.
>>> >>
>>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>> >> provided the gold plugin it can emit the symbols.
>>> >>
>>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>>> >> without an index, so it can't be handled by the linker even with a
>>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>>> >> create an index, so the linker + gold plugin handle it appropriately
>>> >> on an -flto link.
>>> >>
>>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>> >> provided the gold plugin, it handles them but compiles them all the
>>> >> way through to ELF executable instructions via a partial LTO link.
>>> >> This is where we would like to differ in behavior (while also not
>>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>>> >> until the full link step.
>>> >>
>>> >> Let me know if that helps address your concerns.
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Alex
>>> >> >>
>>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>> >> >> > <[hidden email]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>> >> >> > looking
>>> >> >> > forward to feedback and questions.
>>> >> >> > Thanks!
>>> >> >> > Teresa
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>> >> >> > Background
>>> >> >> > can
>>> >> >> > be found in slides from EuroLLVM 2015:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>> >> >> > As described in the talk, we have a prototype implementation, and
>>> >> >> > would like to start staging patches upstream. This RFC describes
>>> >> >> > a
>>> >> >> > breakdown of the major pieces. We would like to commit upstream
>>> >> >> > gradually in several stages, with all functionality off by
>>> >> >> > default.
>>> >> >> > The core ThinLTO importing support and tuning will require
>>> >> >> > frequent
>>> >> >> > change and iteration during testing and tuning, and for that part
>>> >> >> > we
>>> >> >> > would like to commit rapidly (off by default). See the proposed
>>> >> >> > staged
>>> >> >> > implementation described in the Implementation Plan section.
>>> >> >> >
>>> >> >> >
>>> >> >> > ThinLTO Overview
>>> >> >> > ==============
>>> >> >> >
>>> >> >> > See the talk slides linked above for more details. The following
>>> >> >> > is a
>>> >> >> > high-level overview of the motivation.
>>> >> >> >
>>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>> >> >> > improving
>>> >> >> > runtime performance, by extending the scope of optimizations
>>> >> >> > across
>>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>>> >> >> > optimizing within the scope of single source modules. Two
>>> >> >> > solutions
>>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>> >> >> > currently
>>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>> >> >> > Optimization (LIPO). However, each of these solutions has
>>> >> >> > limitations
>>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>>> >> >> > approach that attempts to address these limitations, with a goal
>>> >> >> > of
>>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>>> >> >> > same
>>> >> >> > principals as LIPO, and therefore its advantages, without any of
>>> >> >> > its
>>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>>> >> >> > is
>>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>>> >> >> > compile time, but in a lazy mode that facilitates large scale
>>> >> >> > parallelism. The serial linker plugin phase is designed to be
>>> >> >> > razor
>>> >> >> > thin and blazingly fast. By default this step only does minimal
>>> >> >> > preparation work to enable the parallel lazy importing performed
>>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>> >> >> > enabling
>>> >> >> > CMO on machines without large memory configurations, while also
>>> >> >> > integrating well with distributed build systems. Results from
>>> >> >> > early
>>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>> >> >> > expectations that ThinLTO can scale like O2 while enabling much
>>> >> >> > of
>>> >> >> > the
>>> >> >> > CMO performed during a full LTO build.
>>> >> >> >
>>> >> >> >
>>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to
>>> >> >> > in
>>> >> >> > the
>>> >> >> > following implementation plan:
>>> >> >> >
>>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>> >> >> >
>>> >> >> >
>>> >> >> > Implementation Plan
>>> >> >> > ================
>>> >> >> >
>>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>>> >> >> > that
>>> >> >> > will be added, in roughly the order that the patches would be
>>> >> >> > staged.
>>> >> >> > The patches are divided into three stages. The first stage
>>> >> >> > contains a
>>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>>> >> >> > The
>>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>>> >> >> > which
>>> >> >> > will be off by default. The third stage includes
>>> >> >> > enhancements/improvements/tunings that can be performed after the
>>> >> >> > main
>>> >> >> > ThinLTO infrastructure is in.
>>> >> >> >
>>> >> >> > The second and third implementation stages will initially be very
>>> >> >> > volatile, requiring a lot of iterations and tuning with large
>>> >> >> > apps to
>>> >> >> > get stabilized. Therefore it will be important to do fast commits
>>> >> >> > for
>>> >> >> > these implementation stages.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1. Stage 1: Preparation
>>> >> >> > -------------------------------
>>> >> >> >
>>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>>> >> >> >
>>> >> >> >
>>> >> >> > a. LTO directory structure:
>>> >> >> >
>>> >> >> > Restructure the LTO directory to remove circular dependence when
>>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>>> >> >> > pass
>>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>> >> >> > linking
>>> >> >> > in functions from modules, IPO then requires the LTO library.
>>> >> >> > This
>>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>>> >> >> > we
>>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>>> >> >> > and
>>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>> >> >> > removing
>>> >> >> > the circular dependence.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. ELF wrapper generation support:
>>> >> >> >
>>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>>> >> >> > interact
>>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>>> >> >> > phase-1
>>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>> >> >> > symbol
>>> >> >> > table. The goal is both to interact with these tools without
>>> >> >> > requiring
>>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>> >> >> > files
>>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>>> >> >> > step).
>>> >> >> > I will send a separate design document for these changes, but the
>>> >> >> > following is a high-level overview.
>>> >> >> >
>>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>>> >> >> > to
>>> >> >> > add support for optionally generating bitcode in an ELF file
>>> >> >> > containing a single .llvmbc section holding the bitcode.
>>> >> >> > Specifically,
>>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>>> >> >> > and
>>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>>> >> >> > Eventually these would be automatically triggered under
>>> >> >> > “-fthinlto
>>> >> >> > -c”
>>> >> >> > and “-fthinlto -S”, respectively.
>>> >> >> >
>>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>>> >> >> > holding the function symbols within the bitcode. This facilitates
>>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>>> >> >> > since
>>> >> >> > the archive will have a symbol table as well. The archive symbol
>>> >> >> > table
>>> >> >> > enables gold to extract and pass to the plugin the constituent
>>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>>> >> >> > section
>>> >> >> > generated by “$LD -r”, some handling needs to be added to gold
>>> >> >> > and to
>>> >> >> > the backend driver to process each original module’s bitcode.
>>> >> >> >
>>> >> >> > The function index/summary will later be added as a special ELF
>>> >> >> > section alongside the .llvmbc sections.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>> >> >> > ----------------------------------------------
>>> >> >> >
>>> >> >> > The next set of patches adds the base implementation of the
>>> >> >> > ThinLTO
>>> >> >> > infrastructure, specifically those required to make ThinLTO
>>> >> >> > functional
>>> >> >> > and generate correct but not necessarily high-performing
>>> >> >> > binaries. It
>>> >> >> > also does not include support to make debug support under -g
>>> >> >> > efficient
>>> >> >> > with ThinLTO.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Clang/LLVM/gold linker options:
>>> >> >> >
>>> >> >> > An early set of clang/llvm patches is needed to provide options
>>> >> >> > to
>>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>> >> >> > implementation can be disabled by default as it is added.
>>> >> >> > Specifically, clang options -fthinlto (used instead of -flto)
>>> >> >> > will
>>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>>> >> >> > function summary/index on a compile step, and pass the
>>> >> >> > appropriate
>>> >> >> > option to the gold plugin on a link step. The -thinlto option
>>> >> >> > will be
>>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>>> >> >> > thin
>>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>>> >> >> > tool
>>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>> >> >> >
>>> >> >> > Under the new plugin option (see above), the plugin needs to
>>> >> >> > perform
>>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>>> >> >> > function
>>> >> >> > map from the linked modules, without actually performing the
>>> >> >> > normal
>>> >> >> > link. Corresponding support should be added to the standalone
>>> >> >> > llvm-lto
>>> >> >> > tool to enable testing/debugging without involving the linker and
>>> >> >> > plugin.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. ThinLTO backend support:
>>> >> >> >
>>> >> >> > Support for invoking a phase-3 backend invocation (including
>>> >> >> > importing) on a module should be added to the ‘opt’ tool under
>>> >> >> > the
>>> >> >> > new
>>> >> >> > option. The main change under the option is to instantiate a
>>> >> >> > Linker
>>> >> >> > object used to manage the process of linking imported functions
>>> >> >> > into
>>> >> >> > the module, efficient read of the combined function map, and
>>> >> >> > enable
>>> >> >> > the ThinLTO import pass.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Function index/summary support:
>>> >> >> >
>>> >> >> > This includes infrastructure for writing and reading the function
>>> >> >> > index/summary section. As noted earlier this will be encoded in a
>>> >> >> > special ELF section within the module, alongside the .llvmbc
>>> >> >> > section
>>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>>> >> >> > ThinLTO simply contains all of the function index/summary
>>> >> >> > sections
>>> >> >> > across the linked modules, organized for efficient function
>>> >> >> > lookup.
>>> >> >> >
>>> >> >> > Each function available for importing from the module contains an
>>> >> >> > entry in the module’s function index/summary section and in the
>>> >> >> > resulting combined function map. Each function entry contains
>>> >> >> > that
>>> >> >> > function’s offset within the bitcode file, used to efficiently
>>> >> >> > locate
>>> >> >> > and quickly import just that function. The entry also contains
>>> >> >> > summary
>>> >> >> > information (e.g. basic information determined during parsing
>>> >> >> > such as
>>> >> >> > the number of instructions in the function), that will be used to
>>> >> >> > help
>>> >> >> > guide later import decisions. Because the contents of this
>>> >> >> > section
>>> >> >> > will change frequently during ThinLTO tuning, it should also be
>>> >> >> > marked
>>> >> >> > with a version id for backwards compatibility or version
>>> >> >> > checking.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. ThinLTO importing support:
>>> >> >> >
>>> >> >> > Support for the mechanics of importing functions from other
>>> >> >> > modules,
>>> >> >> > which can go in gradually as a set of patches since it will be
>>> >> >> > off by
>>> >> >> > default. Separate patches can include:
>>> >> >> >
>>> >> >> > - BitcodeReader changes to use function index to
>>> >> >> > import/deserialize
>>> >> >> > single function of interest (small changes, leverages existing
>>> >> >> > lazy
>>> >> >> > streamer support).
>>> >> >> >
>>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>>> >> >> > and
>>> >> >> > its index into bitcode reader.
>>> >> >> >
>>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>> >> >> > symbol
>>> >> >> > linking and global DCE, for example). This can be in-memory
>>> >> >> > initially,
>>> >> >> > but IR support may be required in order to support streaming
>>> >> >> > bitcode
>>> >> >> > out and back in again after importing.
>>> >> >> >
>>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>>> >> >> > static promotion when necessary. The linkage type of imported
>>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>>> >> >> > Statics
>>> >> >> > must be promoted in certain cases, and renamed in consistent
>>> >> >> > ways.
>>> >> >> >
>>> >> >> > - GlobalDCE changes to support removing imported functions that
>>> >> >> > were
>>> >> >> > not inlined (very small changes to existing pass logic).
>>> >> >> >
>>> >> >> >
>>> >> >> > f. ThinLTO Import Driver SCC pass:
>>> >> >> >
>>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>>> >> >> > via
>>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>> >> >> > includes
>>> >> >> > utilizing the thin archive (global function index/summary),
>>> >> >> > import
>>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>> >> >> > routines
>>> >> >> > that perform the import, and any necessary callgraph updates and
>>> >> >> > verification.
>>> >> >> >
>>> >> >> >
>>> >> >> > g. Backend Driver:
>>> >> >> >
>>> >> >> > For a single node build, the gold plugin can simply write a
>>> >> >> > makefile
>>> >> >> > and fork the parallel backend instances directly via parallel
>>> >> >> > make.
>>> >> >> >
>>> >> >> >
>>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>> >> >> > ----------------------------------------------------------------
>>> >> >> >
>>> >> >> > This refers to the patches that are not required for ThinLTO to
>>> >> >> > work,
>>> >> >> > but rather to improve compile time, memory, run-time performance
>>> >> >> > and
>>> >> >> > usability.
>>> >> >> >
>>> >> >> >
>>> >> >> > a. Lazy Debug Metadata Linking:
>>> >> >> >
>>> >> >> > The prototype implementation included lazy importing of
>>> >> >> > module-level
>>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>>> >> >> > function
>>> >> >> > importing is complete). This actually applies to all module-level
>>> >> >> > metadata, not just debug, although it is the largest. This can be
>>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>>> >> >> > ValueMapper, ModuleLinker
>>> >> >> >
>>> >> >> >
>>> >> >> > b. Import Tuning:
>>> >> >> >
>>> >> >> > Tuning the import strategy will be an iterative process that will
>>> >> >> > continue to be refined over time. It involves several different
>>> >> >> > types
>>> >> >> > of changes: adding support for recording additional metrics in
>>> >> >> > the
>>> >> >> > function summary, such as profile data and optional
>>> >> >> > heavier-weight
>>> >> >> > IPA
>>> >> >> > analyses, and tuning the import heuristics based on the summary
>>> >> >> > and
>>> >> >> > callsite context.
>>> >> >> >
>>> >> >> >
>>> >> >> > c. Combined Function Map Pruning:
>>> >> >> >
>>> >> >> > The combined function map can be pruned of functions that are
>>> >> >> > unlikely
>>> >> >> > to benefit from being imported. For example, during the phase-2
>>> >> >> > thin
>>> >> >> > archive plug step we can safely omit large and (with profile
>>> >> >> > data)
>>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>>> >> >> > Additionally, all but one copy of comdat functions can be
>>> >> >> > suppressed.
>>> >> >> >
>>> >> >> >
>>> >> >> > d. Distributed Build System Integration:
>>> >> >> >
>>> >> >> > For a distributed build system, the gold plugin should write the
>>> >> >> > parallel backend invocations into a makefile, including the
>>> >> >> > mapping
>>> >> >> > from the IR file to the real object file path, and exit.
>>> >> >> > Additional
>>> >> >> > work needs to be done in the distributed build system itself to
>>> >> >> > distribute and dispatch the parallel backend jobs to the build
>>> >> >> > cluster.
>>> >> >> >
>>> >> >> >
>>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>> >> >> >
>>> >> >> > In order to support build systems that stage from local disks or
>>> >> >> > network storage, the plugin will optionally support computation
>>> >> >> > of
>>> >> >> > dependent sets of IR files that each module may import from. This
>>> >> >> > can
>>> >> >> > be computed from profile data, if it exists, or from the symbol
>>> >> >> > table
>>> >> >> > and heuristics if not. These dependence sets also enable support
>>> >> >> > for
>>> >> >> > incremental backend compiles.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>> >> >> > <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > LLVM Developers mailing list
>>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> LLVM Developers mailing list
>>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>> >> <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | [hidden email] | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Teresa Johnson
In reply to this post by Xinliang David Li-2
On Thu, May 14, 2015 at 1:14 PM, Xinliang David Li <[hidden email]> wrote:

>
>
> On Thu, May 14, 2015 at 12:46 PM, Robinson, Paul
> <[hidden email]> wrote:
>>
>> The friendliest tactic would be to support all object-file formats, not
>> just ELF?
>
>
> In general it should be wrapped in native object format -- and ELF will be a
> starting point.

Yes, sorry, I should have generalized this to Native Object File
Wrapper format, ala
http://llvm.org/docs/BitCodeFormat.html#native-object-file-wrapper-format.
I was prototyping with ELF, but the writer support should be similar
for other formats supported by LLVM.

Thanks,
Teresa

>
> David
>
>>
>> --paulr
>>
>>
>>
>> From: [hidden email] [mailto:[hidden email]] On
>> Behalf Of Xinliang David Li
>> Sent: Thursday, May 14, 2015 11:54 AM
>> To: Eric Christopher
>> Cc: <[hidden email]> List
>> Subject: Re: [LLVMdev] RFC: ThinLTO Impementation Plan
>>
>>
>>
>> The end goal is the ability to turn on thin-lto as easy as turning
>> optimizations like -O2 or -O3 -- we want friendliness, very much :)
>>
>>
>>
>> David
>>
>>
>>
>>
>>
>> On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]>
>> wrote:
>>
>> I'm not sure this is a particularly great assumption to make. We have to
>> support a lot of different build systems and tools and concentrating on
>> something that just binutils uses isn't particularly friendly here. I also
>> can't imagine how it's necessary for any of the lto aspects as currently
>> written in the proposal.
>>
>>
>>
>> -eric
>>
>>
>>
>> On Thu, May 14, 2015 at 9:26 AM Xinliang David Li <[hidden email]>
>> wrote:
>>
>> The design objective is to make thinLTO mostly transparent to binutil
>> tools to enable easy integration with any build system in the wild.
>> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is another
>> reason.
>>
>>
>>
>> David
>>
>>
>>
>> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson <[hidden email]>
>> wrote:
>>
>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher <[hidden email]>
>> wrote:
>> > So, what Alex is saying is that we have these tools as well and they
>> > understand bitcode just fine, as well as every object format - not just
>> > ELF.
>> > :)
>>
>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>> handle bitcode similarly to the way the standard tool + plugin does.
>> But the goal we are trying to achieve is to allow the standard system
>> versions of the tools to handle these files without requiring a
>> plugin. I know the LLVM tool handles other object formats, but I'm not
>> sure how that helps here? We're not planning to replace those tools,
>> just allow the standard system versions to handle the intermediate
>> objects produced by ThinLTO.
>>
>> Thanks,
>> Teresa
>>
>>
>> >
>> > -eric
>> >
>> >
>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson <[hidden email]>
>> > wrote:
>> >>
>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>> >> <[hidden email]> wrote:
>> >> >
>> >> >
>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>> >> > <[hidden email]>
>> >> > wrote:
>> >> >>
>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>> >> >>
>> >> >> What about ar, nm, and various ld implementations adds this
>> >> >> requirement?
>> >> >> What about the LLVM implementations of these tools is lacking?
>> >> >
>> >> >
>> >> > Sorry I can not parse your questions properly. Can you make it
>> >> > clearer?
>> >>
>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>> >> bitcode that makes using elf-wrapped bitcode easier.
>> >>
>> >> The issue is that generally you need to provide a plugin to these
>> >> tools in order for them to understand and handle bitcode files. We'd
>> >> like standard tools to work without requiring a plugin as much as
>> >> possible. And in some cases we want them to be handled different than
>> >> the way bitcode files are handled with the plugin.
>> >>
>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>> >> provided the gold plugin it can emit the symbols.
>> >>
>> >> ar: Without a plugin, it will create an archive of bitcode files, but
>> >> without an index, so it can't be handled by the linker even with a
>> >> plugin on an -flto link. When ar is provided the gold plugin it does
>> >> create an index, so the linker + gold plugin handle it appropriately
>> >> on an -flto link.
>> >>
>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>> >> provided the gold plugin, it handles them but compiles them all the
>> >> way through to ELF executable instructions via a partial LTO link.
>> >> This is where we would like to differ in behavior (while also not
>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
>> >> output file to still contain ELF-wrapped bitcode, delaying the LTO
>> >> until the full link step.
>> >>
>> >> Let me know if that helps address your concerns.
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> >
>> >> > David
>> >> >
>> >> >>
>> >> >>
>> >> >> Alex
>> >> >>
>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <[hidden email]>
>> >> >> > wrote:
>> >> >> >
>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>> >> >> > looking
>> >> >> > forward to feedback and questions.
>> >> >> > Thanks!
>> >> >> > Teresa
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > RFC to discuss plans for implementing ThinLTO upstream. Background
>> >> >> > can
>> >> >> > be found in slides from EuroLLVM 2015:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> >> >> > As described in the talk, we have a prototype implementation, and
>> >> >> > would like to start staging patches upstream. This RFC describes a
>> >> >> > breakdown of the major pieces. We would like to commit upstream
>> >> >> > gradually in several stages, with all functionality off by
>> >> >> > default.
>> >> >> > The core ThinLTO importing support and tuning will require
>> >> >> > frequent
>> >> >> > change and iteration during testing and tuning, and for that part
>> >> >> > we
>> >> >> > would like to commit rapidly (off by default). See the proposed
>> >> >> > staged
>> >> >> > implementation described in the Implementation Plan section.
>> >> >> >
>> >> >> >
>> >> >> > ThinLTO Overview
>> >> >> > ==============
>> >> >> >
>> >> >> > See the talk slides linked above for more details. The following
>> >> >> > is a
>> >> >> > high-level overview of the motivation.
>> >> >> >
>> >> >> > Cross Module Optimization (CMO) is an effective means for
>> >> >> > improving
>> >> >> > runtime performance, by extending the scope of optimizations
>> >> >> > across
>> >> >> > source module boundaries. Without CMO, the compiler is limited to
>> >> >> > optimizing within the scope of single source modules. Two
>> >> >> > solutions
>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>> >> >> > currently
>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> >> >> > Optimization (LIPO). However, each of these solutions has
>> >> >> > limitations
>> >> >> > that prevent it from being enabled by default. ThinLTO is a new
>> >> >> > approach that attempts to address these limitations, with a goal
>> >> >> > of
>> >> >> > being enabled more broadly. ThinLTO is designed with many of the
>> >> >> > same
>> >> >> > principals as LIPO, and therefore its advantages, without any of
>> >> >> > its
>> >> >> > inherent weakness. Unlike in LIPO where the module group decision
>> >> >> > is
>> >> >> > made at profile training runtime, ThinLTO makes the decision at
>> >> >> > compile time, but in a lazy mode that facilitates large scale
>> >> >> > parallelism. The serial linker plugin phase is designed to be
>> >> >> > razor
>> >> >> > thin and blazingly fast. By default this step only does minimal
>> >> >> > preparation work to enable the parallel lazy importing performed
>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>> >> >> > enabling
>> >> >> > CMO on machines without large memory configurations, while also
>> >> >> > integrating well with distributed build systems. Results from
>> >> >> > early
>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> >> >> > expectations that ThinLTO can scale like O2 while enabling much of
>> >> >> > the
>> >> >> > CMO performed during a full LTO build.
>> >> >> >
>> >> >> >
>> >> >> > A ThinLTO build is divided into 3 phases, which are referred to in
>> >> >> > the
>> >> >> > following implementation plan:
>> >> >> >
>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>> >> >> >
>> >> >> >
>> >> >> > Implementation Plan
>> >> >> > ================
>> >> >> >
>> >> >> > This section gives a high-level breakdown of the ThinLTO support
>> >> >> > that
>> >> >> > will be added, in roughly the order that the patches would be
>> >> >> > staged.
>> >> >> > The patches are divided into three stages. The first stage
>> >> >> > contains a
>> >> >> > minimal amount of preparation work that is not ThinLTO-specific.
>> >> >> > The
>> >> >> > second stage contains most of the infrastructure for ThinLTO,
>> >> >> > which
>> >> >> > will be off by default. The third stage includes
>> >> >> > enhancements/improvements/tunings that can be performed after the
>> >> >> > main
>> >> >> > ThinLTO infrastructure is in.
>> >> >> >
>> >> >> > The second and third implementation stages will initially be very
>> >> >> > volatile, requiring a lot of iterations and tuning with large apps
>> >> >> > to
>> >> >> > get stabilized. Therefore it will be important to do fast commits
>> >> >> > for
>> >> >> > these implementation stages.
>> >> >> >
>> >> >> >
>> >> >> > 1. Stage 1: Preparation
>> >> >> > -------------------------------
>> >> >> >
>> >> >> > The first planned sets of patches are enablers for ThinLTO work:
>> >> >> >
>> >> >> >
>> >> >> > a. LTO directory structure:
>> >> >> >
>> >> >> > Restructure the LTO directory to remove circular dependence when
>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
>> >> >> > pass
>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>> >> >> > linking
>> >> >> > in functions from modules, IPO then requires the LTO library. This
>> >> >> > creates a circular dependence between LTO and IPO. To break that,
>> >> >> > we
>> >> >> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
>> >> >> > and
>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>> >> >> > removing
>> >> >> > the circular dependence.
>> >> >> >
>> >> >> >
>> >> >> > b. ELF wrapper generation support:
>> >> >> >
>> >> >> > Implement ELF wrapped bitcode writer. In order to more easily
>> >> >> > interact
>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
>> >> >> > phase-1
>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>> >> >> > symbol
>> >> >> > table. The goal is both to interact with these tools without
>> >> >> > requiring
>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> >> >> > linked with “$LD -r” (i.e. the resulting object file should still
>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
>> >> >> > step).
>> >> >> > I will send a separate design document for these changes, but the
>> >> >> > following is a high-level overview.
>> >> >> >
>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
>> >> >> > to
>> >> >> > add support for optionally generating bitcode in an ELF file
>> >> >> > containing a single .llvmbc section holding the bitcode.
>> >> >> > Specifically,
>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object file)
>> >> >> > and
>> >> >> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> >> >> > Eventually these would be automatically triggered under “-fthinlto
>> >> >> > -c”
>> >> >> > and “-fthinlto -S”, respectively.
>> >> >> >
>> >> >> > Additionally, a symbol table will be generated in the ELF file,
>> >> >> > holding the function symbols within the bitcode. This facilitates
>> >> >> > handling archives of the ELF-wrapped bitcode created with $AR,
>> >> >> > since
>> >> >> > the archive will have a symbol table as well. The archive symbol
>> >> >> > table
>> >> >> > enables gold to extract and pass to the plugin the constituent
>> >> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
>> >> >> > section
>> >> >> > generated by “$LD -r”, some handling needs to be added to gold and
>> >> >> > to
>> >> >> > the backend driver to process each original module’s bitcode.
>> >> >> >
>> >> >> > The function index/summary will later be added as a special ELF
>> >> >> > section alongside the .llvmbc sections.
>> >> >> >
>> >> >> >
>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>> >> >> > ----------------------------------------------
>> >> >> >
>> >> >> > The next set of patches adds the base implementation of the
>> >> >> > ThinLTO
>> >> >> > infrastructure, specifically those required to make ThinLTO
>> >> >> > functional
>> >> >> > and generate correct but not necessarily high-performing binaries.
>> >> >> > It
>> >> >> > also does not include support to make debug support under -g
>> >> >> > efficient
>> >> >> > with ThinLTO.
>> >> >> >
>> >> >> >
>> >> >> > a. Clang/LLVM/gold linker options:
>> >> >> >
>> >> >> > An early set of clang/llvm patches is needed to provide options to
>> >> >> > enable ThinLTO (off by default), so that the rest of the
>> >> >> > implementation can be disabled by default as it is added.
>> >> >> > Specifically, clang options -fthinlto (used instead of -flto) will
>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> >> >> > function summary/index on a compile step, and pass the appropriate
>> >> >> > option to the gold plugin on a link step. The -thinlto option will
>> >> >> > be
>> >> >> > added to the gold plugin and llvm-lto tool to launch the phase-2
>> >> >> > thin
>> >> >> > archive step. The -thinlto option will also be added to the ‘opt’
>> >> >> > tool
>> >> >> > to invoke it as a phase-3 parallel backend instance.
>> >> >> >
>> >> >> >
>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >> >> >
>> >> >> > Under the new plugin option (see above), the plugin needs to
>> >> >> > perform
>> >> >> > the phase-2 (thin archive) link which simply emits a combined
>> >> >> > function
>> >> >> > map from the linked modules, without actually performing the
>> >> >> > normal
>> >> >> > link. Corresponding support should be added to the standalone
>> >> >> > llvm-lto
>> >> >> > tool to enable testing/debugging without involving the linker and
>> >> >> > plugin.
>> >> >> >
>> >> >> >
>> >> >> > c. ThinLTO backend support:
>> >> >> >
>> >> >> > Support for invoking a phase-3 backend invocation (including
>> >> >> > importing) on a module should be added to the ‘opt’ tool under the
>> >> >> > new
>> >> >> > option. The main change under the option is to instantiate a
>> >> >> > Linker
>> >> >> > object used to manage the process of linking imported functions
>> >> >> > into
>> >> >> > the module, efficient read of the combined function map, and
>> >> >> > enable
>> >> >> > the ThinLTO import pass.
>> >> >> >
>> >> >> >
>> >> >> > d. Function index/summary support:
>> >> >> >
>> >> >> > This includes infrastructure for writing and reading the function
>> >> >> > index/summary section. As noted earlier this will be encoded in a
>> >> >> > special ELF section within the module, alongside the .llvmbc
>> >> >> > section
>> >> >> > containing the bitcode. The thin archive generated by phase-2 of
>> >> >> > ThinLTO simply contains all of the function index/summary sections
>> >> >> > across the linked modules, organized for efficient function
>> >> >> > lookup.
>> >> >> >
>> >> >> > Each function available for importing from the module contains an
>> >> >> > entry in the module’s function index/summary section and in the
>> >> >> > resulting combined function map. Each function entry contains that
>> >> >> > function’s offset within the bitcode file, used to efficiently
>> >> >> > locate
>> >> >> > and quickly import just that function. The entry also contains
>> >> >> > summary
>> >> >> > information (e.g. basic information determined during parsing such
>> >> >> > as
>> >> >> > the number of instructions in the function), that will be used to
>> >> >> > help
>> >> >> > guide later import decisions. Because the contents of this section
>> >> >> > will change frequently during ThinLTO tuning, it should also be
>> >> >> > marked
>> >> >> > with a version id for backwards compatibility or version checking.
>> >> >> >
>> >> >> >
>> >> >> > e. ThinLTO importing support:
>> >> >> >
>> >> >> > Support for the mechanics of importing functions from other
>> >> >> > modules,
>> >> >> > which can go in gradually as a set of patches since it will be off
>> >> >> > by
>> >> >> > default. Separate patches can include:
>> >> >> >
>> >> >> > - BitcodeReader changes to use function index to
>> >> >> > import/deserialize
>> >> >> > single function of interest (small changes, leverages existing
>> >> >> > lazy
>> >> >> > streamer support).
>> >> >> >
>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to import
>> >> >> > and
>> >> >> > its index into bitcode reader.
>> >> >> >
>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>> >> >> > symbol
>> >> >> > linking and global DCE, for example). This can be in-memory
>> >> >> > initially,
>> >> >> > but IR support may be required in order to support streaming
>> >> >> > bitcode
>> >> >> > out and back in again after importing.
>> >> >> >
>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> >> >> > static promotion when necessary. The linkage type of imported
>> >> >> > functions changes to AvailableExternallyLinkage, for example.
>> >> >> > Statics
>> >> >> > must be promoted in certain cases, and renamed in consistent ways.
>> >> >> >
>> >> >> > - GlobalDCE changes to support removing imported functions that
>> >> >> > were
>> >> >> > not inlined (very small changes to existing pass logic).
>> >> >> >
>> >> >> >
>> >> >> > f. ThinLTO Import Driver SCC pass:
>> >> >> >
>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
>> >> >> > via
>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>> >> >> > includes
>> >> >> > utilizing the thin archive (global function index/summary), import
>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> >> >> > that perform the import, and any necessary callgraph updates and
>> >> >> > verification.
>> >> >> >
>> >> >> >
>> >> >> > g. Backend Driver:
>> >> >> >
>> >> >> > For a single node build, the gold plugin can simply write a
>> >> >> > makefile
>> >> >> > and fork the parallel backend instances directly via parallel
>> >> >> > make.
>> >> >> >
>> >> >> >
>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> >> >> > ----------------------------------------------------------------
>> >> >> >
>> >> >> > This refers to the patches that are not required for ThinLTO to
>> >> >> > work,
>> >> >> > but rather to improve compile time, memory, run-time performance
>> >> >> > and
>> >> >> > usability.
>> >> >> >
>> >> >> >
>> >> >> > a. Lazy Debug Metadata Linking:
>> >> >> >
>> >> >> > The prototype implementation included lazy importing of
>> >> >> > module-level
>> >> >> > metadata during the ThinLTO pass finalization (i.e. after all
>> >> >> > function
>> >> >> > importing is complete). This actually applies to all module-level
>> >> >> > metadata, not just debug, although it is the largest. This can be
>> >> >> > added as a separate set of patches. Changes to BitcodeReader,
>> >> >> > ValueMapper, ModuleLinker
>> >> >> >
>> >> >> >
>> >> >> > b. Import Tuning:
>> >> >> >
>> >> >> > Tuning the import strategy will be an iterative process that will
>> >> >> > continue to be refined over time. It involves several different
>> >> >> > types
>> >> >> > of changes: adding support for recording additional metrics in the
>> >> >> > function summary, such as profile data and optional heavier-weight
>> >> >> > IPA
>> >> >> > analyses, and tuning the import heuristics based on the summary
>> >> >> > and
>> >> >> > callsite context.
>> >> >> >
>> >> >> >
>> >> >> > c. Combined Function Map Pruning:
>> >> >> >
>> >> >> > The combined function map can be pruned of functions that are
>> >> >> > unlikely
>> >> >> > to benefit from being imported. For example, during the phase-2
>> >> >> > thin
>> >> >> > archive plug step we can safely omit large and (with profile data)
>> >> >> > cold functions, which are unlikely to benefit from being inlined.
>> >> >> > Additionally, all but one copy of comdat functions can be
>> >> >> > suppressed.
>> >> >> >
>> >> >> >
>> >> >> > d. Distributed Build System Integration:
>> >> >> >
>> >> >> > For a distributed build system, the gold plugin should write the
>> >> >> > parallel backend invocations into a makefile, including the
>> >> >> > mapping
>> >> >> > from the IR file to the real object file path, and exit.
>> >> >> > Additional
>> >> >> > work needs to be done in the distributed build system itself to
>> >> >> > distribute and dispatch the parallel backend jobs to the build
>> >> >> > cluster.
>> >> >> >
>> >> >> >
>> >> >> > e. Dependence Tracking and Incremental Compiles:
>> >> >> >
>> >> >> > In order to support build systems that stage from local disks or
>> >> >> > network storage, the plugin will optionally support computation of
>> >> >> > dependent sets of IR files that each module may import from. This
>> >> >> > can
>> >> >> > be computed from profile data, if it exists, or from the symbol
>> >> >> > table
>> >> >> > and heuristics if not. These dependence sets also enable support
>> >> >> > for
>> >> >> > incremental backend compiles.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>> >> >> > 408-460-2413
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > LLVM Developers mailing list
>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >> >>
>> >> >> _______________________________________________
>> >> >> LLVM Developers mailing list
>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | [hidden email] |
>> >> 408-460-2413
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> [hidden email]         http://llvm.cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413
>>
>>
>>
>>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Teresa Johnson
In reply to this post by Eric Christopher
On Thu, May 14, 2015 at 1:18 PM, Eric Christopher <[hidden email]> wrote:

>
>
> On Thu, May 14, 2015 at 1:11 PM David Blaikie <[hidden email]> wrote:
>>
>> On Thu, May 14, 2015 at 12:53 PM, Eric Christopher <[hidden email]>
>> wrote:
>>>
>>>
>>>
>>> On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]>
>>> wrote:
>>>>
>>>> On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]>
>>>> wrote:
>>>> > I'm not sure this is a particularly great assumption to make.
>>>>
>>>> Which part?
>>>
>>>
>>> The binutils part :)
>>>
>>>>
>>>>
>>>> >  We have to
>>>> > support a lot of different build systems and tools and concentrating
>>>> > on
>>>> > something that just binutils uses isn't particularly friendly here.
>>>> I think you may have misunderstood
>>>> His point was exactly that they want to be transparent to *all of* these
>>>> tools.
>>>> You are saying "we should be friendly to everyone". He is saying the
>>>> same thing.
>>>> We should be friendly to everyone. The friendly way to do this is to
>>>> not require all of these tools build plugins to handle bitcode.
>>>>
>>>> Hence, elf-wrapped bitcode.
>>>
>>>
>>> Oh, I understood. I just don't know that I agree. To do anything with the
>>> tools will require some knowledge of bitcode anyhow or need the plugin. I'm
>>> saying that as a baseline start we should look at how to do this using the
>>> tools we've got rather than wrapping things for no real gain.
>>
>>
>> That doesn't seem strictly true - the ar situation (which I'm lead to
>> believe is in use in our build system & others, one would assume). With the
>> symbol table included as proposed, ar can be used without any knowledge of
>> the bitcode or need for a plugin.
>>
>
> For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld -r'?

But as mentioned, ld -r can work on native object wrapped bitcode
without a plugin as well.

> ;)
>
>>
>> It'd be helpful to have the scenarios we're trying to support with these
>> tools & then weigh up the alternatives.
>>
>
>
> Agreed. The ar situation is interesting because one thing we discussed after
> you wandered off was just adding a ToC section to bitcode as it is and then
> having the tools handle that. Would seem to accomplish at least the goals as
> I've seen them up to this point without worrying too much.

The ToC section is a way we can encode the function index/summary into
bitcode, but won't help integrate with existing tools. The main issue
we are trying to solve is integrating transparently with existing
binutils tools in use in our build system and probably elsewhere.

>
> At any rate, I think this aspect of the proposal needs a bit of discussion
> and some mapping out of the pros and cons here.

Sure, we can continue to discuss and I will try to lay out the pros/cons.

Teresa

>
> -eric
>
>>>
>>> I've talked to Teresa a bit offline and we're going to talk more later
>>> (and discuss on the list), but there are some discussions about how to make
>>> this work either with just bitcode/llvm tools and so not requiring
>>> integration on all platforms. The latter is what I consider as particularly
>>> friendly :)
>>>
>>> -eric
>>>
>>>>
>>>>
>>>>
>>>> > I also
>>>> > can't imagine how it's necessary for any of the lto aspects as
>>>> > currently
>>>> > written in the proposal.
>>>> >
>>>> > -eric
>>>> >
>>>> > On Thu, May 14, 2015 at 9:26 AM Xinliang David Li
>>>> > <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> The design objective is to make thinLTO mostly transparent to binutil
>>>> >> tools to enable easy integration with any build system in the wild.
>>>> >> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is
>>>> >> another
>>>> >> reason.
>>>> >>
>>>> >> David
>>>> >>
>>>> >> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson
>>>> >> <[hidden email]>
>>>> >> wrote:
>>>> >>>
>>>> >>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher
>>>> >>> <[hidden email]>
>>>> >>> wrote:
>>>> >>> > So, what Alex is saying is that we have these tools as well and
>>>> >>> > they
>>>> >>> > understand bitcode just fine, as well as every object format - not
>>>> >>> > just
>>>> >>> > ELF.
>>>> >>> > :)
>>>> >>>
>>>> >>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>>> >>> handle bitcode similarly to the way the standard tool + plugin does.
>>>> >>> But the goal we are trying to achieve is to allow the standard
>>>> >>> system
>>>> >>> versions of the tools to handle these files without requiring a
>>>> >>> plugin. I know the LLVM tool handles other object formats, but I'm
>>>> >>> not
>>>> >>> sure how that helps here? We're not planning to replace those tools,
>>>> >>> just allow the standard system versions to handle the intermediate
>>>> >>> objects produced by ThinLTO.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Teresa
>>>> >>>
>>>> >>> >
>>>> >>> > -eric
>>>> >>> >
>>>> >>> >
>>>> >>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson
>>>> >>> > <[hidden email]>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>>> >>> >> <[hidden email]> wrote:
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>>> >>> >> > <[hidden email]>
>>>> >>> >> > wrote:
>>>> >>> >> >>
>>>> >>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>>> >>> >> >>
>>>> >>> >> >> What about ar, nm, and various ld implementations adds this
>>>> >>> >> >> requirement?
>>>> >>> >> >> What about the LLVM implementations of these tools is lacking?
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > Sorry I can not parse your questions properly. Can you make it
>>>> >>> >> > clearer?
>>>> >>> >>
>>>> >>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>>> >>> >> bitcode that makes using elf-wrapped bitcode easier.
>>>> >>> >>
>>>> >>> >> The issue is that generally you need to provide a plugin to these
>>>> >>> >> tools in order for them to understand and handle bitcode files.
>>>> >>> >> We'd
>>>> >>> >> like standard tools to work without requiring a plugin as much as
>>>> >>> >> possible. And in some cases we want them to be handled different
>>>> >>> >> than
>>>> >>> >> the way bitcode files are handled with the plugin.
>>>> >>> >>
>>>> >>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>>> >>> >> provided the gold plugin it can emit the symbols.
>>>> >>> >>
>>>> >>> >> ar: Without a plugin, it will create an archive of bitcode files,
>>>> >>> >> but
>>>> >>> >> without an index, so it can't be handled by the linker even with
>>>> >>> >> a
>>>> >>> >> plugin on an -flto link. When ar is provided the gold plugin it
>>>> >>> >> does
>>>> >>> >> create an index, so the linker + gold plugin handle it
>>>> >>> >> appropriately
>>>> >>> >> on an -flto link.
>>>> >>> >>
>>>> >>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>>> >>> >> provided the gold plugin, it handles them but compiles them all
>>>> >>> >> the
>>>> >>> >> way through to ELF executable instructions via a partial LTO
>>>> >>> >> link.
>>>> >>> >> This is where we would like to differ in behavior (while also not
>>>> >>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the
>>>> >>> >> ld -r
>>>> >>> >> output file to still contain ELF-wrapped bitcode, delaying the
>>>> >>> >> LTO
>>>> >>> >> until the full link step.
>>>> >>> >>
>>>> >>> >> Let me know if that helps address your concerns.
>>>> >>> >>
>>>> >>> >> Thanks,
>>>> >>> >> Teresa
>>>> >>> >>
>>>> >>> >> >
>>>> >>> >> > David
>>>> >>> >> >
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >> Alex
>>>> >>> >> >>
>>>> >>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>>> >>> >> >> > <[hidden email]>
>>>> >>> >> >> > wrote:
>>>> >>> >> >> >
>>>> >>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>>> >>> >> >> > looking
>>>> >>> >> >> > forward to feedback and questions.
>>>> >>> >> >> > Thanks!
>>>> >>> >> >> > Teresa
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>>> >>> >> >> > Background
>>>> >>> >> >> > can
>>>> >>> >> >> > be found in slides from EuroLLVM 2015:
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>>> >>> >> >> > As described in the talk, we have a prototype
>>>> >>> >> >> > implementation, and
>>>> >>> >> >> > would like to start staging patches upstream. This RFC
>>>> >>> >> >> > describes
>>>> >>> >> >> > a
>>>> >>> >> >> > breakdown of the major pieces. We would like to commit
>>>> >>> >> >> > upstream
>>>> >>> >> >> > gradually in several stages, with all functionality off by
>>>> >>> >> >> > default.
>>>> >>> >> >> > The core ThinLTO importing support and tuning will require
>>>> >>> >> >> > frequent
>>>> >>> >> >> > change and iteration during testing and tuning, and for that
>>>> >>> >> >> > part
>>>> >>> >> >> > we
>>>> >>> >> >> > would like to commit rapidly (off by default). See the
>>>> >>> >> >> > proposed
>>>> >>> >> >> > staged
>>>> >>> >> >> > implementation described in the Implementation Plan section.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > ThinLTO Overview
>>>> >>> >> >> > ==============
>>>> >>> >> >> >
>>>> >>> >> >> > See the talk slides linked above for more details. The
>>>> >>> >> >> > following
>>>> >>> >> >> > is a
>>>> >>> >> >> > high-level overview of the motivation.
>>>> >>> >> >> >
>>>> >>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>>> >>> >> >> > improving
>>>> >>> >> >> > runtime performance, by extending the scope of optimizations
>>>> >>> >> >> > across
>>>> >>> >> >> > source module boundaries. Without CMO, the compiler is
>>>> >>> >> >> > limited to
>>>> >>> >> >> > optimizing within the scope of single source modules. Two
>>>> >>> >> >> > solutions
>>>> >>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>>> >>> >> >> > currently
>>>> >>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>>> >>> >> >> > Optimization (LIPO). However, each of these solutions has
>>>> >>> >> >> > limitations
>>>> >>> >> >> > that prevent it from being enabled by default. ThinLTO is a
>>>> >>> >> >> > new
>>>> >>> >> >> > approach that attempts to address these limitations, with a
>>>> >>> >> >> > goal
>>>> >>> >> >> > of
>>>> >>> >> >> > being enabled more broadly. ThinLTO is designed with many of
>>>> >>> >> >> > the
>>>> >>> >> >> > same
>>>> >>> >> >> > principals as LIPO, and therefore its advantages, without
>>>> >>> >> >> > any of
>>>> >>> >> >> > its
>>>> >>> >> >> > inherent weakness. Unlike in LIPO where the module group
>>>> >>> >> >> > decision
>>>> >>> >> >> > is
>>>> >>> >> >> > made at profile training runtime, ThinLTO makes the decision
>>>> >>> >> >> > at
>>>> >>> >> >> > compile time, but in a lazy mode that facilitates large
>>>> >>> >> >> > scale
>>>> >>> >> >> > parallelism. The serial linker plugin phase is designed to
>>>> >>> >> >> > be
>>>> >>> >> >> > razor
>>>> >>> >> >> > thin and blazingly fast. By default this step only does
>>>> >>> >> >> > minimal
>>>> >>> >> >> > preparation work to enable the parallel lazy importing
>>>> >>> >> >> > performed
>>>> >>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>>> >>> >> >> > enabling
>>>> >>> >> >> > CMO on machines without large memory configurations, while
>>>> >>> >> >> > also
>>>> >>> >> >> > integrating well with distributed build systems. Results
>>>> >>> >> >> > from
>>>> >>> >> >> > early
>>>> >>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>>> >>> >> >> > expectations that ThinLTO can scale like O2 while enabling
>>>> >>> >> >> > much
>>>> >>> >> >> > of
>>>> >>> >> >> > the
>>>> >>> >> >> > CMO performed during a full LTO build.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > A ThinLTO build is divided into 3 phases, which are referred
>>>> >>> >> >> > to
>>>> >>> >> >> > in
>>>> >>> >> >> > the
>>>> >>> >> >> > following implementation plan:
>>>> >>> >> >> >
>>>> >>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>>> >>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>>> >>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > Implementation Plan
>>>> >>> >> >> > ================
>>>> >>> >> >> >
>>>> >>> >> >> > This section gives a high-level breakdown of the ThinLTO
>>>> >>> >> >> > support
>>>> >>> >> >> > that
>>>> >>> >> >> > will be added, in roughly the order that the patches would
>>>> >>> >> >> > be
>>>> >>> >> >> > staged.
>>>> >>> >> >> > The patches are divided into three stages. The first stage
>>>> >>> >> >> > contains a
>>>> >>> >> >> > minimal amount of preparation work that is not
>>>> >>> >> >> > ThinLTO-specific.
>>>> >>> >> >> > The
>>>> >>> >> >> > second stage contains most of the infrastructure for
>>>> >>> >> >> > ThinLTO,
>>>> >>> >> >> > which
>>>> >>> >> >> > will be off by default. The third stage includes
>>>> >>> >> >> > enhancements/improvements/tunings that can be performed
>>>> >>> >> >> > after the
>>>> >>> >> >> > main
>>>> >>> >> >> > ThinLTO infrastructure is in.
>>>> >>> >> >> >
>>>> >>> >> >> > The second and third implementation stages will initially be
>>>> >>> >> >> > very
>>>> >>> >> >> > volatile, requiring a lot of iterations and tuning with
>>>> >>> >> >> > large
>>>> >>> >> >> > apps to
>>>> >>> >> >> > get stabilized. Therefore it will be important to do fast
>>>> >>> >> >> > commits
>>>> >>> >> >> > for
>>>> >>> >> >> > these implementation stages.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 1. Stage 1: Preparation
>>>> >>> >> >> > -------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > The first planned sets of patches are enablers for ThinLTO
>>>> >>> >> >> > work:
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. LTO directory structure:
>>>> >>> >> >> >
>>>> >>> >> >> > Restructure the LTO directory to remove circular dependence
>>>> >>> >> >> > when
>>>> >>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as
>>>> >>> >> >> > a SCC
>>>> >>> >> >> > pass
>>>> >>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>>> >>> >> >> > linking
>>>> >>> >> >> > in functions from modules, IPO then requires the LTO
>>>> >>> >> >> > library.
>>>> >>> >> >> > This
>>>> >>> >> >> > creates a circular dependence between LTO and IPO. To break
>>>> >>> >> >> > that,
>>>> >>> >> >> > we
>>>> >>> >> >> > need to split the lib/LTO directory/library into
>>>> >>> >> >> > lib/LTO/CodeGen
>>>> >>> >> >> > and
>>>> >>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>>> >>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>>> >>> >> >> > removing
>>>> >>> >> >> > the circular dependence.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. ELF wrapper generation support:
>>>> >>> >> >> >
>>>> >>> >> >> > Implement ELF wrapped bitcode writer. In order to more
>>>> >>> >> >> > easily
>>>> >>> >> >> > interact
>>>> >>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit
>>>> >>> >> >> > the
>>>> >>> >> >> > phase-1
>>>> >>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table. The goal is both to interact with these tools without
>>>> >>> >> >> > requiring
>>>> >>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>>> >>> >> >> > files
>>>> >>> >> >> > linked with “$LD -r” (i.e. the resulting object file should
>>>> >>> >> >> > still
>>>> >>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full
>>>> >>> >> >> > link
>>>> >>> >> >> > step).
>>>> >>> >> >> > I will send a separate design document for these changes,
>>>> >>> >> >> > but the
>>>> >>> >> >> > following is a high-level overview.
>>>> >>> >> >> >
>>>> >>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>>> >>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet
>>>> >>> >> >> > exist
>>>> >>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I
>>>> >>> >> >> > plan
>>>> >>> >> >> > to
>>>> >>> >> >> > add support for optionally generating bitcode in an ELF file
>>>> >>> >> >> > containing a single .llvmbc section holding the bitcode.
>>>> >>> >> >> > Specifically,
>>>> >>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object
>>>> >>> >> >> > file)
>>>> >>> >> >> > and
>>>> >>> >> >> > corresponding “emit-llvm-elf” (textual assembly code
>>>> >>> >> >> > equivalent).
>>>> >>> >> >> > Eventually these would be automatically triggered under
>>>> >>> >> >> > “-fthinlto
>>>> >>> >> >> > -c”
>>>> >>> >> >> > and “-fthinlto -S”, respectively.
>>>> >>> >> >> >
>>>> >>> >> >> > Additionally, a symbol table will be generated in the ELF
>>>> >>> >> >> > file,
>>>> >>> >> >> > holding the function symbols within the bitcode. This
>>>> >>> >> >> > facilitates
>>>> >>> >> >> > handling archives of the ELF-wrapped bitcode created with
>>>> >>> >> >> > $AR,
>>>> >>> >> >> > since
>>>> >>> >> >> > the archive will have a symbol table as well. The archive
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table
>>>> >>> >> >> > enables gold to extract and pass to the plugin the
>>>> >>> >> >> > constituent
>>>> >>> >> >> > ELF-wrapped bitcode files. To support the concatenated
>>>> >>> >> >> > llvmbc
>>>> >>> >> >> > section
>>>> >>> >> >> > generated by “$LD -r”, some handling needs to be added to
>>>> >>> >> >> > gold
>>>> >>> >> >> > and to
>>>> >>> >> >> > the backend driver to process each original module’s
>>>> >>> >> >> > bitcode.
>>>> >>> >> >> >
>>>> >>> >> >> > The function index/summary will later be added as a special
>>>> >>> >> >> > ELF
>>>> >>> >> >> > section alongside the .llvmbc sections.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>>> >>> >> >> > ----------------------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > The next set of patches adds the base implementation of the
>>>> >>> >> >> > ThinLTO
>>>> >>> >> >> > infrastructure, specifically those required to make ThinLTO
>>>> >>> >> >> > functional
>>>> >>> >> >> > and generate correct but not necessarily high-performing
>>>> >>> >> >> > binaries. It
>>>> >>> >> >> > also does not include support to make debug support under -g
>>>> >>> >> >> > efficient
>>>> >>> >> >> > with ThinLTO.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. Clang/LLVM/gold linker options:
>>>> >>> >> >> >
>>>> >>> >> >> > An early set of clang/llvm patches is needed to provide
>>>> >>> >> >> > options
>>>> >>> >> >> > to
>>>> >>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>>> >>> >> >> > implementation can be disabled by default as it is added.
>>>> >>> >> >> > Specifically, clang options -fthinlto (used instead of
>>>> >>> >> >> > -flto)
>>>> >>> >> >> > will
>>>> >>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode
>>>> >>> >> >> > and
>>>> >>> >> >> > function summary/index on a compile step, and pass the
>>>> >>> >> >> > appropriate
>>>> >>> >> >> > option to the gold plugin on a link step. The -thinlto
>>>> >>> >> >> > option
>>>> >>> >> >> > will be
>>>> >>> >> >> > added to the gold plugin and llvm-lto tool to launch the
>>>> >>> >> >> > phase-2
>>>> >>> >> >> > thin
>>>> >>> >> >> > archive step. The -thinlto option will also be added to the
>>>> >>> >> >> > ‘opt’
>>>> >>> >> >> > tool
>>>> >>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>>> >>> >> >> >
>>>> >>> >> >> > Under the new plugin option (see above), the plugin needs to
>>>> >>> >> >> > perform
>>>> >>> >> >> > the phase-2 (thin archive) link which simply emits a
>>>> >>> >> >> > combined
>>>> >>> >> >> > function
>>>> >>> >> >> > map from the linked modules, without actually performing the
>>>> >>> >> >> > normal
>>>> >>> >> >> > link. Corresponding support should be added to the
>>>> >>> >> >> > standalone
>>>> >>> >> >> > llvm-lto
>>>> >>> >> >> > tool to enable testing/debugging without involving the
>>>> >>> >> >> > linker and
>>>> >>> >> >> > plugin.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > c. ThinLTO backend support:
>>>> >>> >> >> >
>>>> >>> >> >> > Support for invoking a phase-3 backend invocation (including
>>>> >>> >> >> > importing) on a module should be added to the ‘opt’ tool
>>>> >>> >> >> > under
>>>> >>> >> >> > the
>>>> >>> >> >> > new
>>>> >>> >> >> > option. The main change under the option is to instantiate a
>>>> >>> >> >> > Linker
>>>> >>> >> >> > object used to manage the process of linking imported
>>>> >>> >> >> > functions
>>>> >>> >> >> > into
>>>> >>> >> >> > the module, efficient read of the combined function map, and
>>>> >>> >> >> > enable
>>>> >>> >> >> > the ThinLTO import pass.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > d. Function index/summary support:
>>>> >>> >> >> >
>>>> >>> >> >> > This includes infrastructure for writing and reading the
>>>> >>> >> >> > function
>>>> >>> >> >> > index/summary section. As noted earlier this will be encoded
>>>> >>> >> >> > in a
>>>> >>> >> >> > special ELF section within the module, alongside the .llvmbc
>>>> >>> >> >> > section
>>>> >>> >> >> > containing the bitcode. The thin archive generated by
>>>> >>> >> >> > phase-2 of
>>>> >>> >> >> > ThinLTO simply contains all of the function index/summary
>>>> >>> >> >> > sections
>>>> >>> >> >> > across the linked modules, organized for efficient function
>>>> >>> >> >> > lookup.
>>>> >>> >> >> >
>>>> >>> >> >> > Each function available for importing from the module
>>>> >>> >> >> > contains an
>>>> >>> >> >> > entry in the module’s function index/summary section and in
>>>> >>> >> >> > the
>>>> >>> >> >> > resulting combined function map. Each function entry
>>>> >>> >> >> > contains
>>>> >>> >> >> > that
>>>> >>> >> >> > function’s offset within the bitcode file, used to
>>>> >>> >> >> > efficiently
>>>> >>> >> >> > locate
>>>> >>> >> >> > and quickly import just that function. The entry also
>>>> >>> >> >> > contains
>>>> >>> >> >> > summary
>>>> >>> >> >> > information (e.g. basic information determined during
>>>> >>> >> >> > parsing
>>>> >>> >> >> > such as
>>>> >>> >> >> > the number of instructions in the function), that will be
>>>> >>> >> >> > used to
>>>> >>> >> >> > help
>>>> >>> >> >> > guide later import decisions. Because the contents of this
>>>> >>> >> >> > section
>>>> >>> >> >> > will change frequently during ThinLTO tuning, it should also
>>>> >>> >> >> > be
>>>> >>> >> >> > marked
>>>> >>> >> >> > with a version id for backwards compatibility or version
>>>> >>> >> >> > checking.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > e. ThinLTO importing support:
>>>> >>> >> >> >
>>>> >>> >> >> > Support for the mechanics of importing functions from other
>>>> >>> >> >> > modules,
>>>> >>> >> >> > which can go in gradually as a set of patches since it will
>>>> >>> >> >> > be
>>>> >>> >> >> > off by
>>>> >>> >> >> > default. Separate patches can include:
>>>> >>> >> >> >
>>>> >>> >> >> > - BitcodeReader changes to use function index to
>>>> >>> >> >> > import/deserialize
>>>> >>> >> >> > single function of interest (small changes, leverages
>>>> >>> >> >> > existing
>>>> >>> >> >> > lazy
>>>> >>> >> >> > streamer support).
>>>> >>> >> >> >
>>>> >>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to
>>>> >>> >> >> > import
>>>> >>> >> >> > and
>>>> >>> >> >> > its index into bitcode reader.
>>>> >>> >> >> >
>>>> >>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>>> >>> >> >> > symbol
>>>> >>> >> >> > linking and global DCE, for example). This can be in-memory
>>>> >>> >> >> > initially,
>>>> >>> >> >> > but IR support may be required in order to support streaming
>>>> >>> >> >> > bitcode
>>>> >>> >> >> > out and back in again after importing.
>>>> >>> >> >> >
>>>> >>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking
>>>> >>> >> >> > and
>>>> >>> >> >> > static promotion when necessary. The linkage type of
>>>> >>> >> >> > imported
>>>> >>> >> >> > functions changes to AvailableExternallyLinkage, for
>>>> >>> >> >> > example.
>>>> >>> >> >> > Statics
>>>> >>> >> >> > must be promoted in certain cases, and renamed in consistent
>>>> >>> >> >> > ways.
>>>> >>> >> >> >
>>>> >>> >> >> > - GlobalDCE changes to support removing imported functions
>>>> >>> >> >> > that
>>>> >>> >> >> > were
>>>> >>> >> >> > not inlined (very small changes to existing pass logic).
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > f. ThinLTO Import Driver SCC pass:
>>>> >>> >> >> >
>>>> >>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing
>>>> >>> >> >> > ThinLTO
>>>> >>> >> >> > via
>>>> >>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>>> >>> >> >> > includes
>>>> >>> >> >> > utilizing the thin archive (global function index/summary),
>>>> >>> >> >> > import
>>>> >>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>>> >>> >> >> > routines
>>>> >>> >> >> > that perform the import, and any necessary callgraph updates
>>>> >>> >> >> > and
>>>> >>> >> >> > verification.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > g. Backend Driver:
>>>> >>> >> >> >
>>>> >>> >> >> > For a single node build, the gold plugin can simply write a
>>>> >>> >> >> > makefile
>>>> >>> >> >> > and fork the parallel backend instances directly via
>>>> >>> >> >> > parallel
>>>> >>> >> >> > make.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>>> >>> >> >> >
>>>> >>> >> >> > ----------------------------------------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > This refers to the patches that are not required for ThinLTO
>>>> >>> >> >> > to
>>>> >>> >> >> > work,
>>>> >>> >> >> > but rather to improve compile time, memory, run-time
>>>> >>> >> >> > performance
>>>> >>> >> >> > and
>>>> >>> >> >> > usability.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. Lazy Debug Metadata Linking:
>>>> >>> >> >> >
>>>> >>> >> >> > The prototype implementation included lazy importing of
>>>> >>> >> >> > module-level
>>>> >>> >> >> > metadata during the ThinLTO pass finalization (i.e. after
>>>> >>> >> >> > all
>>>> >>> >> >> > function
>>>> >>> >> >> > importing is complete). This actually applies to all
>>>> >>> >> >> > module-level
>>>> >>> >> >> > metadata, not just debug, although it is the largest. This
>>>> >>> >> >> > can be
>>>> >>> >> >> > added as a separate set of patches. Changes to
>>>> >>> >> >> > BitcodeReader,
>>>> >>> >> >> > ValueMapper, ModuleLinker
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. Import Tuning:
>>>> >>> >> >> >
>>>> >>> >> >> > Tuning the import strategy will be an iterative process that
>>>> >>> >> >> > will
>>>> >>> >> >> > continue to be refined over time. It involves several
>>>> >>> >> >> > different
>>>> >>> >> >> > types
>>>> >>> >> >> > of changes: adding support for recording additional metrics
>>>> >>> >> >> > in
>>>> >>> >> >> > the
>>>> >>> >> >> > function summary, such as profile data and optional
>>>> >>> >> >> > heavier-weight
>>>> >>> >> >> > IPA
>>>> >>> >> >> > analyses, and tuning the import heuristics based on the
>>>> >>> >> >> > summary
>>>> >>> >> >> > and
>>>> >>> >> >> > callsite context.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > c. Combined Function Map Pruning:
>>>> >>> >> >> >
>>>> >>> >> >> > The combined function map can be pruned of functions that
>>>> >>> >> >> > are
>>>> >>> >> >> > unlikely
>>>> >>> >> >> > to benefit from being imported. For example, during the
>>>> >>> >> >> > phase-2
>>>> >>> >> >> > thin
>>>> >>> >> >> > archive plug step we can safely omit large and (with profile
>>>> >>> >> >> > data)
>>>> >>> >> >> > cold functions, which are unlikely to benefit from being
>>>> >>> >> >> > inlined.
>>>> >>> >> >> > Additionally, all but one copy of comdat functions can be
>>>> >>> >> >> > suppressed.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > d. Distributed Build System Integration:
>>>> >>> >> >> >
>>>> >>> >> >> > For a distributed build system, the gold plugin should write
>>>> >>> >> >> > the
>>>> >>> >> >> > parallel backend invocations into a makefile, including the
>>>> >>> >> >> > mapping
>>>> >>> >> >> > from the IR file to the real object file path, and exit.
>>>> >>> >> >> > Additional
>>>> >>> >> >> > work needs to be done in the distributed build system itself
>>>> >>> >> >> > to
>>>> >>> >> >> > distribute and dispatch the parallel backend jobs to the
>>>> >>> >> >> > build
>>>> >>> >> >> > cluster.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>>> >>> >> >> >
>>>> >>> >> >> > In order to support build systems that stage from local
>>>> >>> >> >> > disks or
>>>> >>> >> >> > network storage, the plugin will optionally support
>>>> >>> >> >> > computation
>>>> >>> >> >> > of
>>>> >>> >> >> > dependent sets of IR files that each module may import from.
>>>> >>> >> >> > This
>>>> >>> >> >> > can
>>>> >>> >> >> > be computed from profile data, if it exists, or from the
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table
>>>> >>> >> >> > and heuristics if not. These dependence sets also enable
>>>> >>> >> >> > support
>>>> >>> >> >> > for
>>>> >>> >> >> > incremental backend compiles.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > --
>>>> >>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> >> >> > 408-460-2413
>>>> >>> >> >> >
>>>> >>> >> >> > _______________________________________________
>>>> >>> >> >> > LLVM Developers mailing list
>>>> >>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>> >> >>
>>>> >>> >> >> _______________________________________________
>>>> >>> >> >> LLVM Developers mailing list
>>>> >>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> --
>>>> >>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> >> 408-460-2413
>>>> >>> >>
>>>> >>> >> _______________________________________________
>>>> >>> >> LLVM Developers mailing list
>>>> >>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> 408-460-2413
>>>> >>
>>>> >>
>>>> >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > [hidden email]         http://llvm.cs.uiuc.edu
>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Daniel Berlin
In reply to this post by Eric Christopher
On Thu, May 14, 2015 at 12:53 PM, Eric Christopher <[hidden email]> wrote:

>
>
> On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]> wrote:
>>
>> On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]>
>> wrote:
>> > I'm not sure this is a particularly great assumption to make.
>>
>> Which part?
>
>
> The binutils part :)

I took it as the more general: "we want to simply work with native
toolchains", not as something specific to binutils.

>
>>
>>
>> >  We have to
>> > support a lot of different build systems and tools and concentrating on
>> > something that just binutils uses isn't particularly friendly here.
>> I think you may have misunderstood
>> His point was exactly that they want to be transparent to *all of* these
>> tools.
>> You are saying "we should be friendly to everyone". He is saying the same
>> thing.
>> We should be friendly to everyone. The friendly way to do this is to
>> not require all of these tools build plugins to handle bitcode.
>>
>> Hence, elf-wrapped bitcode.
>
>
> Oh, I understood. I just don't know that I agree.

Fair enough. I just wanted to make sure there wasn't a misunderstanding here :)

> To do anything with the
> tools will require some knowledge of bitcode anyhow or need the plugin.

This is certainly true, but that's part of the point - the ability to
pass through native tools without them  breaking, or worrying about
the bitcode there.

>  I'm
> saying that as a baseline start we should look at how to do this using the
> tools we've got rather than wrapping things for no real gain.

The gain is precisely: "People on different platforms do not have to
use all-llvm tools to have this build mode work".


>
> I've talked to Teresa a bit offline and we're going to talk more later (and
> discuss on the list), but there are some discussions about how to make this
> work either with just bitcode/llvm tools and so not requiring integration on
> all platforms. The latter is what I consider as particularly friendly :)

Sure, if you have a way to make this work that doesn't require
everyone in the world replace ar with llvm-ar and ld with llvm-ld,
sounds awesome :)

(I actually have no real dog in this fight, just trying to make sure
everyone is on the same page ;P)
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: ThinLTO Impementation Plan

Eric Christopher
In reply to this post by Teresa Johnson


On Thu, May 14, 2015 at 1:35 PM Teresa Johnson <[hidden email]> wrote:
On Thu, May 14, 2015 at 1:18 PM, Eric Christopher <[hidden email]> wrote:
>
>
> On Thu, May 14, 2015 at 1:11 PM David Blaikie <[hidden email]> wrote:
>>
>> On Thu, May 14, 2015 at 12:53 PM, Eric Christopher <[hidden email]>
>> wrote:
>>>
>>>
>>>
>>> On Thu, May 14, 2015 at 11:34 AM Daniel Berlin <[hidden email]>
>>> wrote:
>>>>
>>>> On Thu, May 14, 2015 at 11:14 AM, Eric Christopher <[hidden email]>
>>>> wrote:
>>>> > I'm not sure this is a particularly great assumption to make.
>>>>
>>>> Which part?
>>>
>>>
>>> The binutils part :)
>>>
>>>>
>>>>
>>>> >  We have to
>>>> > support a lot of different build systems and tools and concentrating
>>>> > on
>>>> > something that just binutils uses isn't particularly friendly here.
>>>> I think you may have misunderstood
>>>> His point was exactly that they want to be transparent to *all of* these
>>>> tools.
>>>> You are saying "we should be friendly to everyone". He is saying the
>>>> same thing.
>>>> We should be friendly to everyone. The friendly way to do this is to
>>>> not require all of these tools build plugins to handle bitcode.
>>>>
>>>> Hence, elf-wrapped bitcode.
>>>
>>>
>>> Oh, I understood. I just don't know that I agree. To do anything with the
>>> tools will require some knowledge of bitcode anyhow or need the plugin. I'm
>>> saying that as a baseline start we should look at how to do this using the
>>> tools we've got rather than wrapping things for no real gain.
>>
>>
>> That doesn't seem strictly true - the ar situation (which I'm lead to
>> believe is in use in our build system & others, one would assume). With the
>> symbol table included as proposed, ar can be used without any knowledge of
>> the bitcode or need for a plugin.
>>
>
> For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld -r'?

But as mentioned, ld -r can work on native object wrapped bitcode
without a plugin as well.


How? It's not like any partial linking is going to go on inside the bitcode if the linker doesn't understand bitcode.
 
> Agreed. The ar situation is interesting because one thing we discussed after
> you wandered off was just adding a ToC section to bitcode as it is and then
> having the tools handle that. Would seem to accomplish at least the goals as
> I've seen them up to this point without worrying too much.

The ToC section is a way we can encode the function index/summary into
bitcode, but won't help integrate with existing tools. The main issue
we are trying to solve is integrating transparently with existing
binutils tools in use in our build system and probably elsewhere.


Right. I'm not entirely sure what use we're going to see in the existing tools that we want to encompass here. There's some of it for convenience (i.e. nm etc for developers), but they can use a tool that understands bitcode and we can make the existing llvm tools suffice for these needs. 
 
I think the way of looking at this is that we can:

a) go with wrapping things in native object formats, this means
 - some tools continue to work at the cost of additional I/O and space at compile/link time
 - we still have to update some tools to work at all
 
b) we extend those tools/our own tools and have them be drop in replacements to the existing tools. They'll understand the bitcode format natively, they'll be smaller, and we'll be able to push the state of the art in tooling/analysis a bit more in the future without having to rework thin lto.

It's basically a set of trade-offs and for llvm we've historically gone the b direction.

>
> At any rate, I think this aspect of the proposal needs a bit of discussion
> and some mapping out of the pros and cons here.

Sure, we can continue to discuss and I will try to lay out the pros/cons.

Excellent.

-eric
 

Teresa

>
> -eric
>
>>>
>>> I've talked to Teresa a bit offline and we're going to talk more later
>>> (and discuss on the list), but there are some discussions about how to make
>>> this work either with just bitcode/llvm tools and so not requiring
>>> integration on all platforms. The latter is what I consider as particularly
>>> friendly :)
>>>
>>> -eric
>>>
>>>>
>>>>
>>>>
>>>> > I also
>>>> > can't imagine how it's necessary for any of the lto aspects as
>>>> > currently
>>>> > written in the proposal.
>>>> >
>>>> > -eric
>>>> >
>>>> > On Thu, May 14, 2015 at 9:26 AM Xinliang David Li
>>>> > <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> The design objective is to make thinLTO mostly transparent to binutil
>>>> >> tools to enable easy integration with any build system in the wild.
>>>> >> 'Pass-through' mode with 'ld -r' instead of the partial LTO mode is
>>>> >> another
>>>> >> reason.
>>>> >>
>>>> >> David
>>>> >>
>>>> >> On Thu, May 14, 2015 at 7:30 AM, Teresa Johnson
>>>> >> <[hidden email]>
>>>> >> wrote:
>>>> >>>
>>>> >>> On Thu, May 14, 2015 at 7:22 AM, Eric Christopher
>>>> >>> <[hidden email]>
>>>> >>> wrote:
>>>> >>> > So, what Alex is saying is that we have these tools as well and
>>>> >>> > they
>>>> >>> > understand bitcode just fine, as well as every object format - not
>>>> >>> > just
>>>> >>> > ELF.
>>>> >>> > :)
>>>> >>>
>>>> >>> Right, there are also LLVM specific versions (llvm-ar, llvm-nm) that
>>>> >>> handle bitcode similarly to the way the standard tool + plugin does.
>>>> >>> But the goal we are trying to achieve is to allow the standard
>>>> >>> system
>>>> >>> versions of the tools to handle these files without requiring a
>>>> >>> plugin. I know the LLVM tool handles other object formats, but I'm
>>>> >>> not
>>>> >>> sure how that helps here? We're not planning to replace those tools,
>>>> >>> just allow the standard system versions to handle the intermediate
>>>> >>> objects produced by ThinLTO.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Teresa
>>>> >>>
>>>> >>> >
>>>> >>> > -eric
>>>> >>> >
>>>> >>> >
>>>> >>> > On Thu, May 14, 2015, 6:55 AM Teresa Johnson
>>>> >>> > <[hidden email]>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
>>>> >>> >> <[hidden email]> wrote:
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg
>>>> >>> >> > <[hidden email]>
>>>> >>> >> > wrote:
>>>> >>> >> >>
>>>> >>> >> >> "ELF-wrapped bitcode" seems potentially controversial to me.
>>>> >>> >> >>
>>>> >>> >> >> What about ar, nm, and various ld implementations adds this
>>>> >>> >> >> requirement?
>>>> >>> >> >> What about the LLVM implementations of these tools is lacking?
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > Sorry I can not parse your questions properly. Can you make it
>>>> >>> >> > clearer?
>>>> >>> >>
>>>> >>> >> Alex is asking what the issue is with ar, nm, ld -r and regular
>>>> >>> >> bitcode that makes using elf-wrapped bitcode easier.
>>>> >>> >>
>>>> >>> >> The issue is that generally you need to provide a plugin to these
>>>> >>> >> tools in order for them to understand and handle bitcode files.
>>>> >>> >> We'd
>>>> >>> >> like standard tools to work without requiring a plugin as much as
>>>> >>> >> possible. And in some cases we want them to be handled different
>>>> >>> >> than
>>>> >>> >> the way bitcode files are handled with the plugin.
>>>> >>> >>
>>>> >>> >> nm: Without a plugin, normal bitcode files are inscrutable. When
>>>> >>> >> provided the gold plugin it can emit the symbols.
>>>> >>> >>
>>>> >>> >> ar: Without a plugin, it will create an archive of bitcode files,
>>>> >>> >> but
>>>> >>> >> without an index, so it can't be handled by the linker even with
>>>> >>> >> a
>>>> >>> >> plugin on an -flto link. When ar is provided the gold plugin it
>>>> >>> >> does
>>>> >>> >> create an index, so the linker + gold plugin handle it
>>>> >>> >> appropriately
>>>> >>> >> on an -flto link.
>>>> >>> >>
>>>> >>> >> ld -r: Without a plugin, fails when provided bitcode inputs. When
>>>> >>> >> provided the gold plugin, it handles them but compiles them all
>>>> >>> >> the
>>>> >>> >> way through to ELF executable instructions via a partial LTO
>>>> >>> >> link.
>>>> >>> >> This is where we would like to differ in behavior (while also not
>>>> >>> >> requiring a plugin) with ELF-wrapped bitcode: we would like the
>>>> >>> >> ld -r
>>>> >>> >> output file to still contain ELF-wrapped bitcode, delaying the
>>>> >>> >> LTO
>>>> >>> >> until the full link step.
>>>> >>> >>
>>>> >>> >> Let me know if that helps address your concerns.
>>>> >>> >>
>>>> >>> >> Thanks,
>>>> >>> >> Teresa
>>>> >>> >>
>>>> >>> >> >
>>>> >>> >> > David
>>>> >>> >> >
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >> Alex
>>>> >>> >> >>
>>>> >>> >> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson
>>>> >>> >> >> > <[hidden email]>
>>>> >>> >> >> > wrote:
>>>> >>> >> >> >
>>>> >>> >> >> > I've included below an RFC for implementing ThinLTO in LLVM,
>>>> >>> >> >> > looking
>>>> >>> >> >> > forward to feedback and questions.
>>>> >>> >> >> > Thanks!
>>>> >>> >> >> > Teresa
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > RFC to discuss plans for implementing ThinLTO upstream.
>>>> >>> >> >> > Background
>>>> >>> >> >> > can
>>>> >>> >> >> > be found in slides from EuroLLVM 2015:
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>>>> >>> >> >> > As described in the talk, we have a prototype
>>>> >>> >> >> > implementation, and
>>>> >>> >> >> > would like to start staging patches upstream. This RFC
>>>> >>> >> >> > describes
>>>> >>> >> >> > a
>>>> >>> >> >> > breakdown of the major pieces. We would like to commit
>>>> >>> >> >> > upstream
>>>> >>> >> >> > gradually in several stages, with all functionality off by
>>>> >>> >> >> > default.
>>>> >>> >> >> > The core ThinLTO importing support and tuning will require
>>>> >>> >> >> > frequent
>>>> >>> >> >> > change and iteration during testing and tuning, and for that
>>>> >>> >> >> > part
>>>> >>> >> >> > we
>>>> >>> >> >> > would like to commit rapidly (off by default). See the
>>>> >>> >> >> > proposed
>>>> >>> >> >> > staged
>>>> >>> >> >> > implementation described in the Implementation Plan section.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > ThinLTO Overview
>>>> >>> >> >> > ==============
>>>> >>> >> >> >
>>>> >>> >> >> > See the talk slides linked above for more details. The
>>>> >>> >> >> > following
>>>> >>> >> >> > is a
>>>> >>> >> >> > high-level overview of the motivation.
>>>> >>> >> >> >
>>>> >>> >> >> > Cross Module Optimization (CMO) is an effective means for
>>>> >>> >> >> > improving
>>>> >>> >> >> > runtime performance, by extending the scope of optimizations
>>>> >>> >> >> > across
>>>> >>> >> >> > source module boundaries. Without CMO, the compiler is
>>>> >>> >> >> > limited to
>>>> >>> >> >> > optimizing within the scope of single source modules. Two
>>>> >>> >> >> > solutions
>>>> >>> >> >> > for enabling CMO are Link-Time Optimization (LTO), which is
>>>> >>> >> >> > currently
>>>> >>> >> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
>>>> >>> >> >> > Optimization (LIPO). However, each of these solutions has
>>>> >>> >> >> > limitations
>>>> >>> >> >> > that prevent it from being enabled by default. ThinLTO is a
>>>> >>> >> >> > new
>>>> >>> >> >> > approach that attempts to address these limitations, with a
>>>> >>> >> >> > goal
>>>> >>> >> >> > of
>>>> >>> >> >> > being enabled more broadly. ThinLTO is designed with many of
>>>> >>> >> >> > the
>>>> >>> >> >> > same
>>>> >>> >> >> > principals as LIPO, and therefore its advantages, without
>>>> >>> >> >> > any of
>>>> >>> >> >> > its
>>>> >>> >> >> > inherent weakness. Unlike in LIPO where the module group
>>>> >>> >> >> > decision
>>>> >>> >> >> > is
>>>> >>> >> >> > made at profile training runtime, ThinLTO makes the decision
>>>> >>> >> >> > at
>>>> >>> >> >> > compile time, but in a lazy mode that facilitates large
>>>> >>> >> >> > scale
>>>> >>> >> >> > parallelism. The serial linker plugin phase is designed to
>>>> >>> >> >> > be
>>>> >>> >> >> > razor
>>>> >>> >> >> > thin and blazingly fast. By default this step only does
>>>> >>> >> >> > minimal
>>>> >>> >> >> > preparation work to enable the parallel lazy importing
>>>> >>> >> >> > performed
>>>> >>> >> >> > later. ThinLTO aims to be scalable like a regular O2 build,
>>>> >>> >> >> > enabling
>>>> >>> >> >> > CMO on machines without large memory configurations, while
>>>> >>> >> >> > also
>>>> >>> >> >> > integrating well with distributed build systems. Results
>>>> >>> >> >> > from
>>>> >>> >> >> > early
>>>> >>> >> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>>>> >>> >> >> > expectations that ThinLTO can scale like O2 while enabling
>>>> >>> >> >> > much
>>>> >>> >> >> > of
>>>> >>> >> >> > the
>>>> >>> >> >> > CMO performed during a full LTO build.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > A ThinLTO build is divided into 3 phases, which are referred
>>>> >>> >> >> > to
>>>> >>> >> >> > in
>>>> >>> >> >> > the
>>>> >>> >> >> > following implementation plan:
>>>> >>> >> >> >
>>>> >>> >> >> > phase-1: IR and Function Summary Generation (-c compile)
>>>> >>> >> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>>>> >>> >> >> > phase-3: Parallel Backend with Demand-Driven Importing
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > Implementation Plan
>>>> >>> >> >> > ================
>>>> >>> >> >> >
>>>> >>> >> >> > This section gives a high-level breakdown of the ThinLTO
>>>> >>> >> >> > support
>>>> >>> >> >> > that
>>>> >>> >> >> > will be added, in roughly the order that the patches would
>>>> >>> >> >> > be
>>>> >>> >> >> > staged.
>>>> >>> >> >> > The patches are divided into three stages. The first stage
>>>> >>> >> >> > contains a
>>>> >>> >> >> > minimal amount of preparation work that is not
>>>> >>> >> >> > ThinLTO-specific.
>>>> >>> >> >> > The
>>>> >>> >> >> > second stage contains most of the infrastructure for
>>>> >>> >> >> > ThinLTO,
>>>> >>> >> >> > which
>>>> >>> >> >> > will be off by default. The third stage includes
>>>> >>> >> >> > enhancements/improvements/tunings that can be performed
>>>> >>> >> >> > after the
>>>> >>> >> >> > main
>>>> >>> >> >> > ThinLTO infrastructure is in.
>>>> >>> >> >> >
>>>> >>> >> >> > The second and third implementation stages will initially be
>>>> >>> >> >> > very
>>>> >>> >> >> > volatile, requiring a lot of iterations and tuning with
>>>> >>> >> >> > large
>>>> >>> >> >> > apps to
>>>> >>> >> >> > get stabilized. Therefore it will be important to do fast
>>>> >>> >> >> > commits
>>>> >>> >> >> > for
>>>> >>> >> >> > these implementation stages.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 1. Stage 1: Preparation
>>>> >>> >> >> > -------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > The first planned sets of patches are enablers for ThinLTO
>>>> >>> >> >> > work:
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. LTO directory structure:
>>>> >>> >> >> >
>>>> >>> >> >> > Restructure the LTO directory to remove circular dependence
>>>> >>> >> >> > when
>>>> >>> >> >> > ThinLTO pass added. Because ThinLTO is being implemented as
>>>> >>> >> >> > a SCC
>>>> >>> >> >> > pass
>>>> >>> >> >> > within Transforms/IPO, and leverages the LTOModule class for
>>>> >>> >> >> > linking
>>>> >>> >> >> > in functions from modules, IPO then requires the LTO
>>>> >>> >> >> > library.
>>>> >>> >> >> > This
>>>> >>> >> >> > creates a circular dependence between LTO and IPO. To break
>>>> >>> >> >> > that,
>>>> >>> >> >> > we
>>>> >>> >> >> > need to split the lib/LTO directory/library into
>>>> >>> >> >> > lib/LTO/CodeGen
>>>> >>> >> >> > and
>>>> >>> >> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>>>> >>> >> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
>>>> >>> >> >> > removing
>>>> >>> >> >> > the circular dependence.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. ELF wrapper generation support:
>>>> >>> >> >> >
>>>> >>> >> >> > Implement ELF wrapped bitcode writer. In order to more
>>>> >>> >> >> > easily
>>>> >>> >> >> > interact
>>>> >>> >> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit
>>>> >>> >> >> > the
>>>> >>> >> >> > phase-1
>>>> >>> >> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table. The goal is both to interact with these tools without
>>>> >>> >> >> > requiring
>>>> >>> >> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
>>>> >>> >> >> > files
>>>> >>> >> >> > linked with “$LD -r” (i.e. the resulting object file should
>>>> >>> >> >> > still
>>>> >>> >> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full
>>>> >>> >> >> > link
>>>> >>> >> >> > step).
>>>> >>> >> >> > I will send a separate design document for these changes,
>>>> >>> >> >> > but the
>>>> >>> >> >> > following is a high-level overview.
>>>> >>> >> >> >
>>>> >>> >> >> > Support was added to LLVM for reading ELF-wrapped bitcode
>>>> >>> >> >> > (http://reviews.llvm.org/rL218078), but there does not yet
>>>> >>> >> >> > exist
>>>> >>> >> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I
>>>> >>> >> >> > plan
>>>> >>> >> >> > to
>>>> >>> >> >> > add support for optionally generating bitcode in an ELF file
>>>> >>> >> >> > containing a single .llvmbc section holding the bitcode.
>>>> >>> >> >> > Specifically,
>>>> >>> >> >> > the patch would add new options “emit-llvm-bc-elf” (object
>>>> >>> >> >> > file)
>>>> >>> >> >> > and
>>>> >>> >> >> > corresponding “emit-llvm-elf” (textual assembly code
>>>> >>> >> >> > equivalent).
>>>> >>> >> >> > Eventually these would be automatically triggered under
>>>> >>> >> >> > “-fthinlto
>>>> >>> >> >> > -c”
>>>> >>> >> >> > and “-fthinlto -S”, respectively.
>>>> >>> >> >> >
>>>> >>> >> >> > Additionally, a symbol table will be generated in the ELF
>>>> >>> >> >> > file,
>>>> >>> >> >> > holding the function symbols within the bitcode. This
>>>> >>> >> >> > facilitates
>>>> >>> >> >> > handling archives of the ELF-wrapped bitcode created with
>>>> >>> >> >> > $AR,
>>>> >>> >> >> > since
>>>> >>> >> >> > the archive will have a symbol table as well. The archive
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table
>>>> >>> >> >> > enables gold to extract and pass to the plugin the
>>>> >>> >> >> > constituent
>>>> >>> >> >> > ELF-wrapped bitcode files. To support the concatenated
>>>> >>> >> >> > llvmbc
>>>> >>> >> >> > section
>>>> >>> >> >> > generated by “$LD -r”, some handling needs to be added to
>>>> >>> >> >> > gold
>>>> >>> >> >> > and to
>>>> >>> >> >> > the backend driver to process each original module’s
>>>> >>> >> >> > bitcode.
>>>> >>> >> >> >
>>>> >>> >> >> > The function index/summary will later be added as a special
>>>> >>> >> >> > ELF
>>>> >>> >> >> > section alongside the .llvmbc sections.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 2. Stage 2: ThinLTO Infrastructure
>>>> >>> >> >> > ----------------------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > The next set of patches adds the base implementation of the
>>>> >>> >> >> > ThinLTO
>>>> >>> >> >> > infrastructure, specifically those required to make ThinLTO
>>>> >>> >> >> > functional
>>>> >>> >> >> > and generate correct but not necessarily high-performing
>>>> >>> >> >> > binaries. It
>>>> >>> >> >> > also does not include support to make debug support under -g
>>>> >>> >> >> > efficient
>>>> >>> >> >> > with ThinLTO.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. Clang/LLVM/gold linker options:
>>>> >>> >> >> >
>>>> >>> >> >> > An early set of clang/llvm patches is needed to provide
>>>> >>> >> >> > options
>>>> >>> >> >> > to
>>>> >>> >> >> > enable ThinLTO (off by default), so that the rest of the
>>>> >>> >> >> > implementation can be disabled by default as it is added.
>>>> >>> >> >> > Specifically, clang options -fthinlto (used instead of
>>>> >>> >> >> > -flto)
>>>> >>> >> >> > will
>>>> >>> >> >> > cause clang to invoke the phase-1 emission of LLVM bitcode
>>>> >>> >> >> > and
>>>> >>> >> >> > function summary/index on a compile step, and pass the
>>>> >>> >> >> > appropriate
>>>> >>> >> >> > option to the gold plugin on a link step. The -thinlto
>>>> >>> >> >> > option
>>>> >>> >> >> > will be
>>>> >>> >> >> > added to the gold plugin and llvm-lto tool to launch the
>>>> >>> >> >> > phase-2
>>>> >>> >> >> > thin
>>>> >>> >> >> > archive step. The -thinlto option will also be added to the
>>>> >>> >> >> > ‘opt’
>>>> >>> >> >> > tool
>>>> >>> >> >> > to invoke it as a phase-3 parallel backend instance.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>>>> >>> >> >> >
>>>> >>> >> >> > Under the new plugin option (see above), the plugin needs to
>>>> >>> >> >> > perform
>>>> >>> >> >> > the phase-2 (thin archive) link which simply emits a
>>>> >>> >> >> > combined
>>>> >>> >> >> > function
>>>> >>> >> >> > map from the linked modules, without actually performing the
>>>> >>> >> >> > normal
>>>> >>> >> >> > link. Corresponding support should be added to the
>>>> >>> >> >> > standalone
>>>> >>> >> >> > llvm-lto
>>>> >>> >> >> > tool to enable testing/debugging without involving the
>>>> >>> >> >> > linker and
>>>> >>> >> >> > plugin.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > c. ThinLTO backend support:
>>>> >>> >> >> >
>>>> >>> >> >> > Support for invoking a phase-3 backend invocation (including
>>>> >>> >> >> > importing) on a module should be added to the ‘opt’ tool
>>>> >>> >> >> > under
>>>> >>> >> >> > the
>>>> >>> >> >> > new
>>>> >>> >> >> > option. The main change under the option is to instantiate a
>>>> >>> >> >> > Linker
>>>> >>> >> >> > object used to manage the process of linking imported
>>>> >>> >> >> > functions
>>>> >>> >> >> > into
>>>> >>> >> >> > the module, efficient read of the combined function map, and
>>>> >>> >> >> > enable
>>>> >>> >> >> > the ThinLTO import pass.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > d. Function index/summary support:
>>>> >>> >> >> >
>>>> >>> >> >> > This includes infrastructure for writing and reading the
>>>> >>> >> >> > function
>>>> >>> >> >> > index/summary section. As noted earlier this will be encoded
>>>> >>> >> >> > in a
>>>> >>> >> >> > special ELF section within the module, alongside the .llvmbc
>>>> >>> >> >> > section
>>>> >>> >> >> > containing the bitcode. The thin archive generated by
>>>> >>> >> >> > phase-2 of
>>>> >>> >> >> > ThinLTO simply contains all of the function index/summary
>>>> >>> >> >> > sections
>>>> >>> >> >> > across the linked modules, organized for efficient function
>>>> >>> >> >> > lookup.
>>>> >>> >> >> >
>>>> >>> >> >> > Each function available for importing from the module
>>>> >>> >> >> > contains an
>>>> >>> >> >> > entry in the module’s function index/summary section and in
>>>> >>> >> >> > the
>>>> >>> >> >> > resulting combined function map. Each function entry
>>>> >>> >> >> > contains
>>>> >>> >> >> > that
>>>> >>> >> >> > function’s offset within the bitcode file, used to
>>>> >>> >> >> > efficiently
>>>> >>> >> >> > locate
>>>> >>> >> >> > and quickly import just that function. The entry also
>>>> >>> >> >> > contains
>>>> >>> >> >> > summary
>>>> >>> >> >> > information (e.g. basic information determined during
>>>> >>> >> >> > parsing
>>>> >>> >> >> > such as
>>>> >>> >> >> > the number of instructions in the function), that will be
>>>> >>> >> >> > used to
>>>> >>> >> >> > help
>>>> >>> >> >> > guide later import decisions. Because the contents of this
>>>> >>> >> >> > section
>>>> >>> >> >> > will change frequently during ThinLTO tuning, it should also
>>>> >>> >> >> > be
>>>> >>> >> >> > marked
>>>> >>> >> >> > with a version id for backwards compatibility or version
>>>> >>> >> >> > checking.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > e. ThinLTO importing support:
>>>> >>> >> >> >
>>>> >>> >> >> > Support for the mechanics of importing functions from other
>>>> >>> >> >> > modules,
>>>> >>> >> >> > which can go in gradually as a set of patches since it will
>>>> >>> >> >> > be
>>>> >>> >> >> > off by
>>>> >>> >> >> > default. Separate patches can include:
>>>> >>> >> >> >
>>>> >>> >> >> > - BitcodeReader changes to use function index to
>>>> >>> >> >> > import/deserialize
>>>> >>> >> >> > single function of interest (small changes, leverages
>>>> >>> >> >> > existing
>>>> >>> >> >> > lazy
>>>> >>> >> >> > streamer support).
>>>> >>> >> >> >
>>>> >>> >> >> > - Minor LTOModule changes to pass the ThinLTO function to
>>>> >>> >> >> > import
>>>> >>> >> >> > and
>>>> >>> >> >> > its index into bitcode reader.
>>>> >>> >> >> >
>>>> >>> >> >> > - Marking of imported functions (for use in ThinLTO-specific
>>>> >>> >> >> > symbol
>>>> >>> >> >> > linking and global DCE, for example). This can be in-memory
>>>> >>> >> >> > initially,
>>>> >>> >> >> > but IR support may be required in order to support streaming
>>>> >>> >> >> > bitcode
>>>> >>> >> >> > out and back in again after importing.
>>>> >>> >> >> >
>>>> >>> >> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking
>>>> >>> >> >> > and
>>>> >>> >> >> > static promotion when necessary. The linkage type of
>>>> >>> >> >> > imported
>>>> >>> >> >> > functions changes to AvailableExternallyLinkage, for
>>>> >>> >> >> > example.
>>>> >>> >> >> > Statics
>>>> >>> >> >> > must be promoted in certain cases, and renamed in consistent
>>>> >>> >> >> > ways.
>>>> >>> >> >> >
>>>> >>> >> >> > - GlobalDCE changes to support removing imported functions
>>>> >>> >> >> > that
>>>> >>> >> >> > were
>>>> >>> >> >> > not inlined (very small changes to existing pass logic).
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > f. ThinLTO Import Driver SCC pass:
>>>> >>> >> >> >
>>>> >>> >> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing
>>>> >>> >> >> > ThinLTO
>>>> >>> >> >> > via
>>>> >>> >> >> > an SCC pass, enabled only under -fthinlto options. The pass
>>>> >>> >> >> > includes
>>>> >>> >> >> > utilizing the thin archive (global function index/summary),
>>>> >>> >> >> > import
>>>> >>> >> >> > decision heuristics, invocation of LTOModule/ModuleLinker
>>>> >>> >> >> > routines
>>>> >>> >> >> > that perform the import, and any necessary callgraph updates
>>>> >>> >> >> > and
>>>> >>> >> >> > verification.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > g. Backend Driver:
>>>> >>> >> >> >
>>>> >>> >> >> > For a single node build, the gold plugin can simply write a
>>>> >>> >> >> > makefile
>>>> >>> >> >> > and fork the parallel backend instances directly via
>>>> >>> >> >> > parallel
>>>> >>> >> >> > make.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
>>>> >>> >> >> >
>>>> >>> >> >> > ----------------------------------------------------------------
>>>> >>> >> >> >
>>>> >>> >> >> > This refers to the patches that are not required for ThinLTO
>>>> >>> >> >> > to
>>>> >>> >> >> > work,
>>>> >>> >> >> > but rather to improve compile time, memory, run-time
>>>> >>> >> >> > performance
>>>> >>> >> >> > and
>>>> >>> >> >> > usability.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > a. Lazy Debug Metadata Linking:
>>>> >>> >> >> >
>>>> >>> >> >> > The prototype implementation included lazy importing of
>>>> >>> >> >> > module-level
>>>> >>> >> >> > metadata during the ThinLTO pass finalization (i.e. after
>>>> >>> >> >> > all
>>>> >>> >> >> > function
>>>> >>> >> >> > importing is complete). This actually applies to all
>>>> >>> >> >> > module-level
>>>> >>> >> >> > metadata, not just debug, although it is the largest. This
>>>> >>> >> >> > can be
>>>> >>> >> >> > added as a separate set of patches. Changes to
>>>> >>> >> >> > BitcodeReader,
>>>> >>> >> >> > ValueMapper, ModuleLinker
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > b. Import Tuning:
>>>> >>> >> >> >
>>>> >>> >> >> > Tuning the import strategy will be an iterative process that
>>>> >>> >> >> > will
>>>> >>> >> >> > continue to be refined over time. It involves several
>>>> >>> >> >> > different
>>>> >>> >> >> > types
>>>> >>> >> >> > of changes: adding support for recording additional metrics
>>>> >>> >> >> > in
>>>> >>> >> >> > the
>>>> >>> >> >> > function summary, such as profile data and optional
>>>> >>> >> >> > heavier-weight
>>>> >>> >> >> > IPA
>>>> >>> >> >> > analyses, and tuning the import heuristics based on the
>>>> >>> >> >> > summary
>>>> >>> >> >> > and
>>>> >>> >> >> > callsite context.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > c. Combined Function Map Pruning:
>>>> >>> >> >> >
>>>> >>> >> >> > The combined function map can be pruned of functions that
>>>> >>> >> >> > are
>>>> >>> >> >> > unlikely
>>>> >>> >> >> > to benefit from being imported. For example, during the
>>>> >>> >> >> > phase-2
>>>> >>> >> >> > thin
>>>> >>> >> >> > archive plug step we can safely omit large and (with profile
>>>> >>> >> >> > data)
>>>> >>> >> >> > cold functions, which are unlikely to benefit from being
>>>> >>> >> >> > inlined.
>>>> >>> >> >> > Additionally, all but one copy of comdat functions can be
>>>> >>> >> >> > suppressed.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > d. Distributed Build System Integration:
>>>> >>> >> >> >
>>>> >>> >> >> > For a distributed build system, the gold plugin should write
>>>> >>> >> >> > the
>>>> >>> >> >> > parallel backend invocations into a makefile, including the
>>>> >>> >> >> > mapping
>>>> >>> >> >> > from the IR file to the real object file path, and exit.
>>>> >>> >> >> > Additional
>>>> >>> >> >> > work needs to be done in the distributed build system itself
>>>> >>> >> >> > to
>>>> >>> >> >> > distribute and dispatch the parallel backend jobs to the
>>>> >>> >> >> > build
>>>> >>> >> >> > cluster.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > e. Dependence Tracking and Incremental Compiles:
>>>> >>> >> >> >
>>>> >>> >> >> > In order to support build systems that stage from local
>>>> >>> >> >> > disks or
>>>> >>> >> >> > network storage, the plugin will optionally support
>>>> >>> >> >> > computation
>>>> >>> >> >> > of
>>>> >>> >> >> > dependent sets of IR files that each module may import from.
>>>> >>> >> >> > This
>>>> >>> >> >> > can
>>>> >>> >> >> > be computed from profile data, if it exists, or from the
>>>> >>> >> >> > symbol
>>>> >>> >> >> > table
>>>> >>> >> >> > and heuristics if not. These dependence sets also enable
>>>> >>> >> >> > support
>>>> >>> >> >> > for
>>>> >>> >> >> > incremental backend compiles.
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> > --
>>>> >>> >> >> > Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> >> >> > 408-460-2413
>>>> >>> >> >> >
>>>> >>> >> >> > _______________________________________________
>>>> >>> >> >> > LLVM Developers mailing list
>>>> >>> >> >> > [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>> >> >>
>>>> >>> >> >> _______________________________________________
>>>> >>> >> >> LLVM Developers mailing list
>>>> >>> >> >> [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> --
>>>> >>> >> Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> >> 408-460-2413
>>>> >>> >>
>>>> >>> >> _______________________________________________
>>>> >>> >> LLVM Developers mailing list
>>>> >>> >> [hidden email]         http://llvm.cs.uiuc.edu
>>>> >>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Teresa Johnson | Software Engineer | [hidden email] |
>>>> >>> 408-460-2413
>>>> >>
>>>> >>
>>>> >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > [hidden email]         http://llvm.cs.uiuc.edu
>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> >
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>



--
Teresa Johnson | Software Engineer | [hidden email] | 408-460-2413

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
1234