[llvm-dev] [RFC] Abstract Parallel IR Optimizations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC] Abstract Parallel IR Optimizations

Tim Northover via llvm-dev
This is an RFC to add analyses and transformation passes into LLVM to
optimize programs based on an abstract notion of a parallel region.

  == this is _not_ a proposal to add a new encoding of parallelism ==

We currently perform poorly when it comes to optimizations for parallel
codes. In fact, parallelizing your loops might actually prevent various
optimizations that would have been applied otherwise. One solution to
this problem is to teach the compiler about the semantics of the used
parallel representation. While this sounds tedious at first, it turns
out that we can perform key optimizations with reasonable implementation
effort (and thereby also reasonable maintenance costs). However, we have
various parallel representations that are already in use (KMPC,
GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).

Our proposal seeks to introduce parallelism specific optimizations for
multiple representations while minimizing the implementation overhead.
This is done through an abstract notion of a parallel region which hides
the actual representation from the analysis and optimization passes. In
the schemata below, our current five optimizations (described in detail
here [0]) are shown on the left, the abstract parallel IR interface is
is in the middle, and the representation specific implementations is on
the right.

         Optimization          (A)nalysis/(T)ransformation         Impl.
   ---------------------------------------------------------------------------
     CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)
       RegionExpander -\ |                                     |   GOMPImpl (A)
   AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...
   BarrierElimination -/ |
VariablePrivatization /  \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)


In our setting, a parallel region can be an outlined function called
through a runtime library but also a fork-join/attach-reattach region
embedded in an otherwise sequential code. The new optimizations will
provide parallelism specific optimizations to all of them (if
applicable). There are various reasons why we believe this is a
worthwhile effort that belongs into the LLVM codebase, including:

  1) We improve the performance of parallel programs, today.
  2) It serves as a meaningful baseline for future discussions on
     (optimized) parallel representations.
  3) It allows to determine the pros and cons of the different schemes
     when it comes to actual optimizations and inputs.
  4) It helps to identify problems that might arise once we start to
     transform parallel programs but _before_ we commit to a specific
     representation.

Our prototypes for the OpenMP KMPC library (used by clang) already shows
significant speedups for various benchmarks [0]. It also exposed a (to
me) prior unknown problem between restrict/noalias pointers and
(potential) barriers (see Section 3 in [0]).

We are currently in the process of cleaning the code, extending the
support for OpenMP constructs and adding a second implementation for a
embedded parallel regions. Though, a first horizontal prototype
implementation is already available for review [1].

Inputs of any kind are welcome and reviewers are needed!

Cheers,
  Johannes


[0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[1] https://reviews.llvm.org/D47300


P.S.
  Sorry if you received this message multiple times!

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Abstract Parallel IR Optimizations

Tim Northover via llvm-dev

Hi Johannes,

 

It is not an IR (encoding) in the sense that there are no new

instructions, intrinsics, or anything else that is materialized.

 

If the interface provides a set of functions that query information from the underlying “implementation” (via ParallelRegionInfo, ParallelCommunicationInfo) and modify something in the implementation (via ParallelIR/Builder), that sounds like an IR to me.  It may be hiding the implementation details, and different compilers may use different implementations, but the 5 parallel analyses and optimizations don’t care about that: they’re operating through the interface.

 

I'm still confused about the hooks.

 

Please take a look at the Google Doc. They’re explained clearly there.  We’ll be circulating that soon to the list, once we’ve finished up the more detailed parts.

 

—Vikram Adve

 

// Interim Head, Department of Computer Science

// Donald B. Gillies Professor of Computer Science
// University of Illinois at Urbana-Champaign
// Admin Assistant: Amanda Foley - [hidden email]
// Google Hangouts: [hidden email] || Skype: vikramsadve
// Research page: http://vikram.cs.illinois.edu

 

 

 

From: Johannes Doerfert <[hidden email]>
Date: Thursday, June 7, 2018 at 6:19 AM
To: "Adve, Vikram Sadanand" <[hidden email]>
Cc: LLVM-Dev <[hidden email]>, Hal Finkel <[hidden email]>, TB Schardl <[hidden email]>, "Tian, Xinmin" <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: FW: [RFC] Abstract Parallel IR Optimizations

 

Hi Vikram,

 

Thanks for the quick reply. I put some comments below.

 

On 06/07, Adve, Vikram Sadanand wrote:

Johannes,

I definitely agree that LLVM needs better support for optimizing parallel programs.  Comments inline:

 

> From: Johannes Doerfert <[hidden email]>

> Date: Wednesday, June 6, 2018 at 10:52 AM

> To: LLVM-Dev <[hidden email]>

> Cc: Hal Finkel <[hidden email]>, TB Schardl <[hidden email]>, "Adve, Vikram Sadanand" <[hidden email]>, "Tian, Xinmin" <[hidden email]>, "[hidden email]" <[hidden email]>

> Subject: [RFC] Abstract Parallel IR Optimizations

>

> This is an RFC to add analyses and transformation passes into LLVM to

> optimize programs based on an abstract notion of a parallel region.

>

>   == this is _not_ a proposal to add a new encoding of parallelism ==

Actually, I suspect that your three “abstract” interfaces below

(ParallelRegionInfo, ParallelCommunicationInfo, ParallelIR/Builder)

*are* a parallel IR definition, and what you call the

“implementations” essentially encapsulate *target-specific*

variations.

 

It is not an IR (encoding) in the sense that there are no new

instructions, intrinsics, or anything else that is materialized. The

interface provides a unified view of existing parallel representations

and it is as general as possible.

 

I am unsure what you mean by target-specific here but our

implementations are the actual parallel IR representations as we

discussed them several times on other occasions.

 

IOW, this looks like a perhaps-very-preliminary parallel IR, not some

generic abstract interface.  I can only suspect this because I don’t

have any information about what info or operations these interfaces

provide.

 

I disagree. Please check out the operations and info for the interface

here

and for the builder here

so we talk about the same thing.

 

In fact, I think it *is* a good idea to have a proper parallel IR to

support the kinds of optimizations you’re describing.  I just don’t

think you can define a universal parallel IR for all the parallel

languages and targets that LLVM may want to support.  At a minimum, I

think we need to experiment more before committing to one (*if* one

proved enough).

 

I think you misunderstood our intention here. This patch is in support

of such experiments as it allows to apply the optimizations we have and

might come up with to different parallel IR representations. Not all

optimizations will be applicable to all representations but I don't see

a reason why this is a problem. I also don't think this is a universal

parallel IR of sorts but it is a way to extract critical information

necessary for optimizations out of parallel representations.

 

 

Instead of defining a specific parallel IR or interface, as you’re

trying to do, it would be much better to provide IR-independent hooks

in LLVM so that different parallel IRs can sit “above” the LLVM

representation and support these and other optimizations.  Your IR (or

IR implementations, whatever you call them) could be layered in this

way.

 

I do not understand what you mean here. The "implementations" (as I

called them so recklessly) lift _some_ information from the actual IR

representation to the high level abstract parallel IR interface level

for consummation by the optimization passes. Different parallel IRs can

come with their own implementation and provide more or less information

to enable optimizations.

 

The earlier RFC that Hal Finkel and Xinmin Tian circulated were for

such a set of hooks.  In fact, these hooks also work on regions, but

are much more limited and are *not* meant to support parallel passes

themselves: those are left to any IR built above them.

 

As you know, we (Hal, Xinmin, TB Schardl, George Stelle, Hashim

Sharif, I, and a few others) are trying to refine that proposal to

provide examples of multiple IRs the hooks can support, and to make a

clearer argument for the soundness and correctness impact of the hooks

on LLVM passes.  You’ve been invited to our conference calls and to

edit the Google Doc.  It would be valuable if you could add examples

of how your IR information could be connected to LLVM via these hooks.

 

I'm still confused about the hooks.

 

More specific questions below.

> We currently perform poorly when it comes to optimizations for parallel

> codes. In fact, parallelizing your loops might actually prevent various

> optimizations that would have been applied otherwise. One solution to

> this problem is to teach the compiler about the semantics of the used

> parallel representation. While this sounds tedious at first, it turns

> out that we can perform key optimizations with reasonable implementation

> effort (and thereby also reasonable maintenance costs). However, we have

> various parallel representations that are already in use (KMPC,

> GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).

>

> Our proposal seeks to introduce parallelism specific optimizations for

> multiple representations while minimizing the implementation overhead.

> This is done through an abstract notion of a parallel region which hides

> the actual representation from the analysis and optimization passes.

In general, for analysis and optimization passes to work with multiple

representations, you’d need to have a pretty rich abstract notion of a

parallel region.  I suspect that your design is pretty

OpenMP-specific.  Even within that context, it would have to capture

quite a rich set of parallel constructs.  Can you describe what

exactly is your abstract notion of a parallel region, and how is it

different from a concrete representation?

 

Good question. The interface implementations will abstract away of the

representation details that the optimization should not worry about such

as:

 

  - runtime library calls

  - variable passing via structures

  - intrinsic calls to encode information

  - ...

 

If you look at our variable privatization pass (which was not included

in this commit) it does not need to know if variable capturing is done

through a runtime library call (OpenMP), if captured variables are

handed to a special intrinsic to mark them (IntelPIR), or if capturing

is just a def-use chain crossing the parallel region boundaries (Tapir).

The pass requires to know the mapping (variable in sequential code ->

variable in parallel code) as well as the start/end instruction of the

parallel region. Then it can test for conditions that would allow to

privatize the variable, pass it by-value instead of by-reference, etc.

 

Similarly, other passes do not need to know the details of the encoding.

You can probably see that for barrier elimination but also parallel

region expansion is actually only looking at the surrounding code to

determine if expansion (and region merging) might make sense.

 

> In the schemata below, our current five optimizations (described in

> detail here [0]) are shown on the left, the abstract parallel IR

> interface is is in the middle, and the representation specific

> implementations is on the right.

>

>          Optimization          (A)nalysis/(T)ransformation         Impl.

>    ---------------------------------------------------------------------------

>      CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)

>        RegionExpander -\ |                                     |   GOMPImpl (A)

>    AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...

>    BarrierElimination -/ |

> VariablePrivatization /  \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)

I wasn’t able to understand parts of this figure.  What info do

ParallelRegionInfo() and ParallelCommunicationInfo() provide?   What

operations does ParallelIR/Builder provide?

 

So far not much. ParallelRegionInfo just collects and manages the

parallel regions. These know about their extend (start/end), provide a

way to visit the blocks and instructions and have some helper functions

to reason about (potential) barriers. Nothing is specific to OpenMP.

ParallelCommunicationInfo provides the mapping of variables outside the

parallel region to the ones inside and tells you about the way they are

used (e.g., the pointer is actually a read-only container). Finally, the

builder is the only way to modify the parallel regions (as we don't know

the actual representation in the optimization). Currently it can only

add/remove attributes but my local version can create parallel regions,

build barriers, etc.

 

For the current code check the two links from above.

> In our setting, a parallel region can be an outlined function called

> through a runtime library but also a fork-join/attach-reattach region

> embedded in an otherwise sequential code. The new optimizations will

> provide parallelism specific optimizations to all of them (if

> applicable). There are various reasons why we believe this is a

> worthwhile effort that belongs into the LLVM codebase, including:

Before adding something this broad into the LLVM code base, I think we

need to understand a whole slew of things about it, starting with the

questions above.

 

I'm not sure about broad. Do you mean because of the commit size? Most

of it is boilerplate and it is a fully working horizontal prototype.

Actually, there are only 1091 lines (all types included) added to

llvm/lib/.

 

>   1.  We improve the performance of parallel programs, today.

There are a number of important parallel languages and both

language-specific and target-specific parallel optimizations.  Just

because these five optimizations improve OpenMP program performance

isn’t enough to justify adding a new parallel IR (or interface) to

LLVM, without knowing whether other languages, targets and

optimizations could be correctly and effectively implemented using

this approach.

 

This will not prevent us from testing other languages and optimizations,

actually it should enable us to do so. If it somehow turns out that this

abstraction makes it impossible to implement an optimization correctly

and effectively and that optimization is very important to us we can

still get rid of the abstraction and keep the rest of the code. The

other optimization we have will still work even without the abstraction.

The same holds true if we decide on a parallel IR representation and

do not need to support multiple ones anymore.

 

>    2) It serves as a meaningful baseline for future discussions on

>    (optimized) parallel representations.

We can’t just add a new parallel IR design (abstract or concrete) to

mainline just to serve as a baseline for future discussions.

 

Not _just_! Adding something beneficial to mainline that makes sense (at

least for me but let's see what others think) does not strike me as a

problem. If that would happen, we would finally have a baseline for

parallel optimizations instead of comparing unoptimized programs to

optimized ones encoded in the new IR we came up with.

 

>   3) It allows to determine the pros and cons of the different schemes

>      when it comes to actual optimizations and inputs.

Same problem.

 

This is not the same problem. When you design your IR and you have to

make trade offs you would probably perform experiments and decide.

However, for that you need a test bed to perform the experiments in, an

implementation of both choices, etc. If we combine optimizations we

think are meaningful with representations we think are reasonable we can

actually learn if and how they impact each other for benchmarks we care

about.

 

>   4) It helps to identify problems that might arise once we start to

>      transform parallel programs but _before_ we commit to a specific

>      representation.

I suspect you *are* committing to a specific representation, although

we can’t be sure until you provide more details.

 

Please check out the links to the interface codes above.

 

> Our prototypes for the OpenMP KMPC library (used by clang) already shows

> significant speedups for various benchmarks [0]. It also exposed a (to

> me) prior unknown problem between restrict/noalias pointers and

> (potential) barriers (see Section 3 in [0]).

>

> We are currently in the process of cleaning the code, extending the

> support for OpenMP constructs and adding a second implementation for a

> embedded parallel regions. Though, a first horizontal prototype

> implementation is already available for review [1].

>

> Inputs of any kind are welcome and reviewers are needed!

>

> Cheers,

>   Johannes

--

Johannes Doerfert

PhD Student / Researcher

Compiler Design Lab (Professor Hack) / Argonne National Laboratory

Saarland Informatics Campus, Germany / Lemont, IL 60439, USA

Building E1.3, Room 4.31

   *

—Vikram Adve

// Interim Head, Department of Computer Science

// Donald B. Gillies Professor of Computer Science

// University of Illinois at Urbana-Champaign

// Admin Assistant: Amanda Foley - [hidden email]<[hidden email]>

// Google Hangouts: [hidden email]<[hidden email]> || Skype: vikramsadve

 

 

 

--

 

Johannes Doerfert

PhD Student / Researcher

 

Compiler Design Lab (Professor Hack) / Argonne National Laboratory

Saarland Informatics Campus, Germany / Lemont, IL 60439, USA

Building E1.3, Room 4.31

 

Tel. +49 (0)681 302-57521 : [hidden email] / [hidden email]

 


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Abstract Parallel IR Optimizations

Tim Northover via llvm-dev
In reply to this post by Tim Northover via llvm-dev
Hi Johannes,

apologies in advance if the questions following are silly or don't
make sense. I lack a bit of context here and I'm not sure to fully
understand your proposal.

Currently clang (and flang) are lowering OpenMP when building LLVM IR
(this is because LLVM IR can't express the parallel/concurrent
concepts of OpenMP so they have to be lowered first). So, can I assume
that your proposal starts off in a context where that lowering is not
happening anymore in the front end but it'd happen later in a LLVM IR
pass? If so, then you'd be assuming that there is already a way of
representing OpenMP constructs in the LLVM IR, is my understanding
correct here? I think that the Intel proposal [1] could be one way
(not necessarily the one) to do this (disregarding the fact that it is
tailored for OpenMP), does this still make sense?

If this is the case, and given that you explicitly state that this is
not a Parallel IR of any sort, is your suggestion to improve
optimisation of OpenMP code, based on a "side-car"/ancillary
representation built on top of the existing IR, which as I understand
should already be able to represent OpenMP? But then this looks a bit
redundant to me. So I'm pretty sure one of my assumptions is
incorrect. Unless your auxiliar representation is more an alternative
to the W-regions [1].

Or, maybe I am completely wrong here: you didn't say anything about
the FE lowering, which would still happen, and then your proposal
builds on top of that. I don't think you meant that, given that your
proposal mentions KMP and GOMP (and the current lowering done by clang
targets only KMP).

Thank you very much,
Roger

[1] https://dl.acm.org/citation.cfm?id=3148191

2018-06-07 12:25 GMT+02:00 Johannes Doerfert via llvm-dev
<[hidden email]>:

> This is an RFC to add analyses and transformation passes into LLVM to
> optimize programs based on an abstract notion of a parallel region.
>
>   == this is _not_ a proposal to add a new encoding of parallelism ==
>
> We currently perform poorly when it comes to optimizations for parallel
> codes. In fact, parallelizing your loops might actually prevent various
> optimizations that would have been applied otherwise. One solution to
> this problem is to teach the compiler about the semantics of the used
> parallel representation. While this sounds tedious at first, it turns
> out that we can perform key optimizations with reasonable implementation
> effort (and thereby also reasonable maintenance costs). However, we have
> various parallel representations that are already in use (KMPC,
> GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).
>
> Our proposal seeks to introduce parallelism specific optimizations for
> multiple representations while minimizing the implementation overhead.
> This is done through an abstract notion of a parallel region which hides
> the actual representation from the analysis and optimization passes. In
> the schemata below, our current five optimizations (described in detail
> here [0]) are shown on the left, the abstract parallel IR interface is
> is in the middle, and the representation specific implementations is on
> the right.
>
>          Optimization          (A)nalysis/(T)ransformation         Impl.
>    ---------------------------------------------------------------------------
>      CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)
>        RegionExpander -\ |                                     |   GOMPImpl (A)
>    AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...
>    BarrierElimination -/ |
> VariablePrivatization /  \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)
>
>
> In our setting, a parallel region can be an outlined function called
> through a runtime library but also a fork-join/attach-reattach region
> embedded in an otherwise sequential code. The new optimizations will
> provide parallelism specific optimizations to all of them (if
> applicable). There are various reasons why we believe this is a
> worthwhile effort that belongs into the LLVM codebase, including:
>
>   1) We improve the performance of parallel programs, today.
>   2) It serves as a meaningful baseline for future discussions on
>      (optimized) parallel representations.
>   3) It allows to determine the pros and cons of the different schemes
>      when it comes to actual optimizations and inputs.
>   4) It helps to identify problems that might arise once we start to
>      transform parallel programs but _before_ we commit to a specific
>      representation.
>
> Our prototypes for the OpenMP KMPC library (used by clang) already shows
> significant speedups for various benchmarks [0]. It also exposed a (to
> me) prior unknown problem between restrict/noalias pointers and
> (potential) barriers (see Section 3 in [0]).
>
> We are currently in the process of cleaning the code, extending the
> support for OpenMP constructs and adding a second implementation for a
> embedded parallel regions. Though, a first horizontal prototype
> implementation is already available for review [1].
>
> Inputs of any kind are welcome and reviewers are needed!
>
> Cheers,
>   Johannes
>
>
> [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
> [1] https://reviews.llvm.org/D47300
>
>
> P.S.
>   Sorry if you received this message multiple times!
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>



--
Roger Ferrer Ibáñez
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Abstract Parallel IR Optimizations

Tim Northover via llvm-dev
Hi Roger,

On 06/12, Roger Ferrer Ibáñez wrote:
> apologies in advance if the questions following are silly or don't
> make sense. I lack a bit of context here and I'm not sure to fully
> understand your proposal.

No worries, I'm glad if people ask questions!

> Currently clang (and flang) are lowering OpenMP when building LLVM IR
> (this is because LLVM IR can't express the parallel/concurrent
> concepts of OpenMP so they have to be lowered first). So, can I assume
> that your proposal starts off in a context where that lowering is not
> happening anymore in the front end but it'd happen later in a LLVM IR
> pass? If so, then you'd be assuming that there is already a way of
> representing OpenMP constructs in the LLVM IR, is my understanding
> correct here? I think that the Intel proposal [1] could be one way
> (not necessarily the one) to do this (disregarding the fact that it is
> tailored for OpenMP), does this still make sense?
My proposal does _not_ assume we change clang in any way, though it does
also not require it. However, the initial patch [1] will only work with
the OpenMP lowering used by clang right now.

The idea is as follows:

  We have different representation of parallelism in the IR, for example
  the KMP runtime library calls emitted by clang or the Intel parallel
  IR you mentioned. For each of them we write a piece of code that (1)
  extracts domain specific information and (2) allows to modify the
  parallel representation. This is the only piece of code that has to be
  adapted for each parallel representation we want to optimize. On top
  of this are abstract interfaces that expose the information and
  modification options to parallel optimization passes. The patch [1]
  only contains the attribute annotator but we have more as explained in
  the paper [0]. The analysis/optimization logic is part of these passes
  and not aware of the underlying representation. We can consequently
  use the same passes to optimize code that was lowered to use different
  parallel runtime libraries (GOMP, KMP, Cilk runtime, TBB, ...) or into
  a native parallel IR (of any shape). This is especially useful as the
  native parallel IR might not always be usable. If that happens we have
  to fallback to early outlining, thus runtime library calls emitted by
  the front-end. Even if we at some point have a native parallel
  representation that is always used, we can simply remove the
  abstraction introduced by this approach but keep the
  analysis/optimizations around.

[0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[1] https://reviews.llvm.org/D47300

> If this is the case, and given that you explicitly state that this is
> not a Parallel IR of any sort, is your suggestion to improve
> optimisation of OpenMP code, based on a "side-car"/ancillary
> representation built on top of the existing IR, which as I understand
> should already be able to represent OpenMP? But then this looks a bit
> redundant to me. So I'm pretty sure one of my assumptions is
> incorrect. Unless your auxiliar representation is more an alternative
> to the W-regions [1].
>
> Or, maybe I am completely wrong here: you didn't say anything about
> the FE lowering, which would still happen, and then your proposal
> builds on top of that. I don't think you meant that, given that your
> proposal mentions KMP and GOMP (and the current lowering done by clang
> targets only KMP).
I'm not sure if these paragraphs are still relevant. Does the above
"explanation" answers you questions already? If not, please continue
asking!

Cheers,
  Johannes

> Thank you very much,
> Roger
>
> [1] https://dl.acm.org/citation.cfm?id=3148191
>
> 2018-06-07 12:25 GMT+02:00 Johannes Doerfert via llvm-dev
> <[hidden email]>:
> > This is an RFC to add analyses and transformation passes into LLVM to
> > optimize programs based on an abstract notion of a parallel region.
> >
> >   == this is _not_ a proposal to add a new encoding of parallelism ==
> >
> > We currently perform poorly when it comes to optimizations for parallel
> > codes. In fact, parallelizing your loops might actually prevent various
> > optimizations that would have been applied otherwise. One solution to
> > this problem is to teach the compiler about the semantics of the used
> > parallel representation. While this sounds tedious at first, it turns
> > out that we can perform key optimizations with reasonable implementation
> > effort (and thereby also reasonable maintenance costs). However, we have
> > various parallel representations that are already in use (KMPC,
> > GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).
> >
> > Our proposal seeks to introduce parallelism specific optimizations for
> > multiple representations while minimizing the implementation overhead.
> > This is done through an abstract notion of a parallel region which hides
> > the actual representation from the analysis and optimization passes. In
> > the schemata below, our current five optimizations (described in detail
> > here [0]) are shown on the left, the abstract parallel IR interface is
> > is in the middle, and the representation specific implementations is on
> > the right.
> >
> >          Optimization          (A)nalysis/(T)ransformation         Impl.
> >    ---------------------------------------------------------------------------
> >      CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)
> >        RegionExpander -\ |                                     |   GOMPImpl (A)
> >    AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...
> >    BarrierElimination -/ |
> > VariablePrivatization /  \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)
> >
> >
> > In our setting, a parallel region can be an outlined function called
> > through a runtime library but also a fork-join/attach-reattach region
> > embedded in an otherwise sequential code. The new optimizations will
> > provide parallelism specific optimizations to all of them (if
> > applicable). There are various reasons why we believe this is a
> > worthwhile effort that belongs into the LLVM codebase, including:
> >
> >   1) We improve the performance of parallel programs, today.
> >   2) It serves as a meaningful baseline for future discussions on
> >      (optimized) parallel representations.
> >   3) It allows to determine the pros and cons of the different schemes
> >      when it comes to actual optimizations and inputs.
> >   4) It helps to identify problems that might arise once we start to
> >      transform parallel programs but _before_ we commit to a specific
> >      representation.
> >
> > Our prototypes for the OpenMP KMPC library (used by clang) already shows
> > significant speedups for various benchmarks [0]. It also exposed a (to
> > me) prior unknown problem between restrict/noalias pointers and
> > (potential) barriers (see Section 3 in [0]).
> >
> > We are currently in the process of cleaning the code, extending the
> > support for OpenMP constructs and adding a second implementation for a
> > embedded parallel regions. Though, a first horizontal prototype
> > implementation is already available for review [1].
> >
> > Inputs of any kind are welcome and reviewers are needed!
> >
> > Cheers,
> >   Johannes
> >
> >
> > [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
> > [1] https://reviews.llvm.org/D47300
> >
> >
> > P.S.
> >   Sorry if you received this message multiple times!
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > [hidden email]
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
>
>
> --
> Roger Ferrer Ibáñez
--

Johannes Doerfert
PhD Student / Researcher

Compiler Design Lab (Professor Hack) / Argonne National Laboratory
Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : [hidden email] / [hidden email]
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Abstract Parallel IR Optimizations

Tim Northover via llvm-dev
Hi Johannes,

thanks a lot, all clear now!

Kind regards,
Roger


2018-06-12 16:23 GMT+02:00 Johannes Doerfert <[hidden email]>:

> Hi Roger,
>
> On 06/12, Roger Ferrer Ibáñez wrote:
>> apologies in advance if the questions following are silly or don't
>> make sense. I lack a bit of context here and I'm not sure to fully
>> understand your proposal.
>
> No worries, I'm glad if people ask questions!
>
>> Currently clang (and flang) are lowering OpenMP when building LLVM IR
>> (this is because LLVM IR can't express the parallel/concurrent
>> concepts of OpenMP so they have to be lowered first). So, can I assume
>> that your proposal starts off in a context where that lowering is not
>> happening anymore in the front end but it'd happen later in a LLVM IR
>> pass? If so, then you'd be assuming that there is already a way of
>> representing OpenMP constructs in the LLVM IR, is my understanding
>> correct here? I think that the Intel proposal [1] could be one way
>> (not necessarily the one) to do this (disregarding the fact that it is
>> tailored for OpenMP), does this still make sense?
>
> My proposal does _not_ assume we change clang in any way, though it does
> also not require it. However, the initial patch [1] will only work with
> the OpenMP lowering used by clang right now.
>
> The idea is as follows:
>
>   We have different representation of parallelism in the IR, for example
>   the KMP runtime library calls emitted by clang or the Intel parallel
>   IR you mentioned. For each of them we write a piece of code that (1)
>   extracts domain specific information and (2) allows to modify the
>   parallel representation. This is the only piece of code that has to be
>   adapted for each parallel representation we want to optimize. On top
>   of this are abstract interfaces that expose the information and
>   modification options to parallel optimization passes. The patch [1]
>   only contains the attribute annotator but we have more as explained in
>   the paper [0]. The analysis/optimization logic is part of these passes
>   and not aware of the underlying representation. We can consequently
>   use the same passes to optimize code that was lowered to use different
>   parallel runtime libraries (GOMP, KMP, Cilk runtime, TBB, ...) or into
>   a native parallel IR (of any shape). This is especially useful as the
>   native parallel IR might not always be usable. If that happens we have
>   to fallback to early outlining, thus runtime library calls emitted by
>   the front-end. Even if we at some point have a native parallel
>   representation that is always used, we can simply remove the
>   abstraction introduced by this approach but keep the
>   analysis/optimizations around.
>
> [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
> [1] https://reviews.llvm.org/D47300
>
>> If this is the case, and given that you explicitly state that this is
>> not a Parallel IR of any sort, is your suggestion to improve
>> optimisation of OpenMP code, based on a "side-car"/ancillary
>> representation built on top of the existing IR, which as I understand
>> should already be able to represent OpenMP? But then this looks a bit
>> redundant to me. So I'm pretty sure one of my assumptions is
>> incorrect. Unless your auxiliar representation is more an alternative
>> to the W-regions [1].
>>
>> Or, maybe I am completely wrong here: you didn't say anything about
>> the FE lowering, which would still happen, and then your proposal
>> builds on top of that. I don't think you meant that, given that your
>> proposal mentions KMP and GOMP (and the current lowering done by clang
>> targets only KMP).
>
> I'm not sure if these paragraphs are still relevant. Does the above
> "explanation" answers you questions already? If not, please continue
> asking!
>
> Cheers,
>   Johannes
>
>> Thank you very much,
>> Roger
>>
>> [1] https://dl.acm.org/citation.cfm?id=3148191
>>
>> 2018-06-07 12:25 GMT+02:00 Johannes Doerfert via llvm-dev
>> <[hidden email]>:
>> > This is an RFC to add analyses and transformation passes into LLVM to
>> > optimize programs based on an abstract notion of a parallel region.
>> >
>> >   == this is _not_ a proposal to add a new encoding of parallelism ==
>> >
>> > We currently perform poorly when it comes to optimizations for parallel
>> > codes. In fact, parallelizing your loops might actually prevent various
>> > optimizations that would have been applied otherwise. One solution to
>> > this problem is to teach the compiler about the semantics of the used
>> > parallel representation. While this sounds tedious at first, it turns
>> > out that we can perform key optimizations with reasonable implementation
>> > effort (and thereby also reasonable maintenance costs). However, we have
>> > various parallel representations that are already in use (KMPC,
>> > GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).
>> >
>> > Our proposal seeks to introduce parallelism specific optimizations for
>> > multiple representations while minimizing the implementation overhead.
>> > This is done through an abstract notion of a parallel region which hides
>> > the actual representation from the analysis and optimization passes. In
>> > the schemata below, our current five optimizations (described in detail
>> > here [0]) are shown on the left, the abstract parallel IR interface is
>> > is in the middle, and the representation specific implementations is on
>> > the right.
>> >
>> >          Optimization          (A)nalysis/(T)ransformation         Impl.
>> >    ---------------------------------------------------------------------------
>> >      CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)
>> >        RegionExpander -\ |                                     |   GOMPImpl (A)
>> >    AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...
>> >    BarrierElimination -/ |
>> > VariablePrivatization /  \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)
>> >
>> >
>> > In our setting, a parallel region can be an outlined function called
>> > through a runtime library but also a fork-join/attach-reattach region
>> > embedded in an otherwise sequential code. The new optimizations will
>> > provide parallelism specific optimizations to all of them (if
>> > applicable). There are various reasons why we believe this is a
>> > worthwhile effort that belongs into the LLVM codebase, including:
>> >
>> >   1) We improve the performance of parallel programs, today.
>> >   2) It serves as a meaningful baseline for future discussions on
>> >      (optimized) parallel representations.
>> >   3) It allows to determine the pros and cons of the different schemes
>> >      when it comes to actual optimizations and inputs.
>> >   4) It helps to identify problems that might arise once we start to
>> >      transform parallel programs but _before_ we commit to a specific
>> >      representation.
>> >
>> > Our prototypes for the OpenMP KMPC library (used by clang) already shows
>> > significant speedups for various benchmarks [0]. It also exposed a (to
>> > me) prior unknown problem between restrict/noalias pointers and
>> > (potential) barriers (see Section 3 in [0]).
>> >
>> > We are currently in the process of cleaning the code, extending the
>> > support for OpenMP constructs and adding a second implementation for a
>> > embedded parallel regions. Though, a first horizontal prototype
>> > implementation is already available for review [1].
>> >
>> > Inputs of any kind are welcome and reviewers are needed!
>> >
>> > Cheers,
>> >   Johannes
>> >
>> >
>> > [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
>> > [1] https://reviews.llvm.org/D47300
>> >
>> >
>> > P.S.
>> >   Sorry if you received this message multiple times!
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > [hidden email]
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>>
>>
>> --
>> Roger Ferrer Ibáñez
>
> --
>
> Johannes Doerfert
> PhD Student / Researcher
>
> Compiler Design Lab (Professor Hack) / Argonne National Laboratory
> Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
> Building E1.3, Room 4.31
>
> Tel. +49 (0)681 302-57521 : [hidden email] / [hidden email]
> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert



--
Roger Ferrer Ibáñez
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev