[llvm-dev] Floating point operations with specific rounding and exception properties

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Thanks,
--Serge

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
On Tue, Aug 20, 2019 at 1:02 PM Serge Pavlov via llvm-dev <[hidden email]> wrote:
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Okay, I'll bite...

Preventing the hoisting of FP arithmetic was one of the driving factors in creating the constrained intrinsics. If we could solve that problem, then the constrained intrinsics would be *less* necessary (I say "less" since there are other problems, but hoisting is one of the significant ones).

That said, our out-of-tree FPEnv mode attempts to do just that -- selectively throttle unsafe optimizations. Barring any YDKWYDK's, I intend to blow the doors off of the constrained intrinsics, performance-wise. :P


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
Which optimization did you find unsafe?

Thanks,
--Serge


ср, 21 авг. 2019 г. в 05:12, Cameron McInally <[hidden email]>:
On Tue, Aug 20, 2019 at 1:02 PM Serge Pavlov via llvm-dev <[hidden email]> wrote:
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Okay, I'll bite...

Preventing the hoisting of FP arithmetic was one of the driving factors in creating the constrained intrinsics. If we could solve that problem, then the constrained intrinsics would be *less* necessary (I say "less" since there are other problems, but hoisting is one of the significant ones).

That said, our out-of-tree FPEnv mode attempts to do just that -- selectively throttle unsafe optimizations. Barring any YDKWYDK's, I intend to blow the doors off of the constrained intrinsics, performance-wise. :P


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
In reply to this post by Rajesh S R via llvm-dev

Hi,

The LLVM-VP extension (https://reviews.llvm.org/D57504) generalizes PatternMatch.h to match FP intrinsics as well as regular fp (vector) instructions with the same pattern. We use this to lift the pattern rewrites in InstSimplify and InstCombine to predicated vector instructions. The same logic could be applied to "scalar" constrained FP intrinsics. Hal has requested that the VP intrinsics model fp exception/rounding too.

So the suggestions is to keep using fp exception/rounding mode arguments but teaching LLVM to handle them in its optimizations and analysis.

Example
-----------

PatternMatch.h changes: https://reviews.llvm.org/D57504#change-cWgJ3XBlLNvs
AddSub in code in InstCombine: https://reviews.llvm.org/D57504#change-24P4gqRF9sNj
Note that "visitPredicatedFSub" will match either the regular FSub instruction or the llvm.vp.fsub intrinsic.


- Simon


On 8/20/19 7:00 PM, Serge Pavlov via llvm-dev wrote:
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Thanks,
--Serge

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : [hidden email]
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
In reply to this post by Rajesh S R via llvm-dev
On Tue, Aug 20, 2019 at 9:15 PM Serge Pavlov <[hidden email]> wrote:
Which optimization did you find unsafe?

Thanks,
--Serge


ср, 21 авг. 2019 г. в 05:12, Cameron McInally <[hidden email]>:
On Tue, Aug 20, 2019 at 1:02 PM Serge Pavlov via llvm-dev <[hidden email]> wrote:
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Okay, I'll bite...

Preventing the hoisting of FP arithmetic was one of the driving factors in creating the constrained intrinsics. If we could solve that problem, then the constrained intrinsics would be *less* necessary (I say "less" since there are other problems, but hoisting is one of the significant ones).

That said, our out-of-tree FPEnv mode attempts to do just that -- selectively throttle unsafe optimizations. Barring any YDKWYDK's, I intend to blow the doors off of the constrained intrinsics, performance-wise. :P


Oh, there are quite a lot. I mentioned Hoisting already. Constant Folding is a big one. InstCombine and DAGCombine have some issues, like preserving masks (op+select masks -- this may be less of a problem with true predication). The LoopVectorizer also needed proper masks (not just masked loads/stores) for targets that support them. APFloat has some issues (I'm intending to upstream fixes for signaling NaNs, if I ever have time). And a host of others.

Stepping back a little, the goal of FPEnv-safe compilation is just that... to avoid unsafe FP transformations. The constrained intrinsics implementation seeks to prevent almost all FP optimizations at first, safe and unsafe, and then later add safe optimizations back in. My alternative implementation is to find and *very* selectively throttle unsafe optimizations -- my intuition says that there are far less unsafe optimization than there are safe optimizations. I believe this is the much shorter path. So, the two competing implementations are really attacking the problem from two different ends. Who gets to the goal first is TBD...

To be completely fair, is my alternative solution the best path for upstream LLVM? Maybe, maybe not. The constrained intrinsics will be far less buggy in the early stages, since essentially all optimizations are quashed. But in the same breath, safe code running at the equivalent of -O0 is fairly useless (at least to our customers).   

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Floating point operations with specific rounding and exception properties

Rajesh S R via llvm-dev
Thank you for sharing your experience. It seems that LICM may also be unsafe.

 my intuition says that there are far less unsafe optimization than there are safe optimizations.
...
 safe code running at the equivalent of -O0 is fairly useless
 
I believe this is the right viewpoint.There must be a way without sacrificing performance.

Thanks,
--Serge


ср, 21 авг. 2019 г. в 21:57, Cameron McInally <[hidden email]>:
On Tue, Aug 20, 2019 at 9:15 PM Serge Pavlov <[hidden email]> wrote:
Which optimization did you find unsafe?

Thanks,
--Serge


ср, 21 авг. 2019 г. в 05:12, Cameron McInally <[hidden email]>:
On Tue, Aug 20, 2019 at 1:02 PM Serge Pavlov via llvm-dev <[hidden email]> wrote:
Hi all,

During the review of https://reviews.llvm.org/D65997 an issue was revealed, which relates to the decision of how compiler should represents constrained floating point operations.

If a floating point operation requires rounding mode or exception behavior different from the default, it should be represented by constrained intrinsic (http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). An important point is that according to the current design decision, if some part of a function contains such intrinsic, all floating point operations in the function must be represented by constrained intrinsics as well. Such decision should prevent from undesired moves of fp operations. The discussion is in the thread http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html, the relevant example is:

double f(double a, double b, double c) {
  {
#pragma STDC FENV_ACCESS ON
    feenableexcept(FE_OVERFLOW);
    double d = a * b;
    fedisableexcept(FE_OVERFLOW);
  }
  return c * d;
}

The second fmul must not be hoisted up to before the fedisableexcept. Using constrained intrinsics is expected to help in this case as they are not handled by optimization passes.

The concern is that using constrained intrinsics in a small region of a function results in using such intrinsics everywhere in the function including functions that inline it. As constrained intrinsics prevent from optimizations, it can result in performance degradation.

A couple of examples:
1. There is a performance critical function that makes most of calculations in default fp mode, but in some points it enables fp exceptions and makes an action that can trigger such exception. Using constrained intrinsics would result in performance loss, although the code that actually needs them is very compact.
2. Cores that are used for machine learning usually work with short data (half, bfloat16 or even shorter). Rounding control in this case is much more important than for big cores; using proper rounding in different parts of algorithm can gain precision. Constrained intrinsics is the only way to enforce particular rounding mode. However using them results in poor optimization, which is intolerable. In such cores rounding mode may be encoded in instructions, so code movements cannot break semantics.

Representation of fp operations could be more flexible, so that a user would not pay for rounding/exception control by performance degradation. For that we need to be able to mix constrained intrinsics and regular fp operation in a function.

The question is: how can we prevent from moving fp operations through boundaries of a region, where specific rounding and/or exception behavior are applied? Any ideas?

Okay, I'll bite...

Preventing the hoisting of FP arithmetic was one of the driving factors in creating the constrained intrinsics. If we could solve that problem, then the constrained intrinsics would be *less* necessary (I say "less" since there are other problems, but hoisting is one of the significant ones).

That said, our out-of-tree FPEnv mode attempts to do just that -- selectively throttle unsafe optimizations. Barring any YDKWYDK's, I intend to blow the doors off of the constrained intrinsics, performance-wise. :P


Oh, there are quite a lot. I mentioned Hoisting already. Constant Folding is a big one. InstCombine and DAGCombine have some issues, like preserving masks (op+select masks -- this may be less of a problem with true predication). The LoopVectorizer also needed proper masks (not just masked loads/stores) for targets that support them. APFloat has some issues (I'm intending to upstream fixes for signaling NaNs, if I ever have time). And a host of others.

Stepping back a little, the goal of FPEnv-safe compilation is just that... to avoid unsafe FP transformations. The constrained intrinsics implementation seeks to prevent almost all FP optimizations at first, safe and unsafe, and then later add safe optimizations back in. My alternative implementation is to find and *very* selectively throttle unsafe optimizations -- my intuition says that there are far less unsafe optimization than there are safe optimizations. I believe this is the much shorter path. So, the two competing implementations are really attacking the problem from two different ends. Who gets to the goal first is TBD...

To be completely fair, is my alternative solution the best path for upstream LLVM? Maybe, maybe not. The constrained intrinsics will be far less buggy in the early stages, since essentially all optimizations are quashed. But in the same breath, safe code running at the equivalent of -O0 is fairly useless (at least to our customers).   

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev