[llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
Hi @ll,

LLVM-7.0.0-win32.exe contains and installs
lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib

The implementation of (at least) the multiplication and division
routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
libraries SUCKS: they are factors SLOWER than even Microsoft's
NOTORIOUS POOR implementation of 64-bit division shipped with
MSVC and Windows!

The reasons: 1. subroutine matroschka, 2. "C" implementation!

JFTR: the target processor "i386" (introduced October 1985) is
      a 32-bit processor, it has instructions to divide 64-bit
      integers by 32-bit integers, and to multiply two 32-bit
      integers giving a 64-bit product!
      I expect that a library written 20+ years later takes
      advantage of these instructions!

__divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
                           plus a final negation, instead of just
                           a single IDIV
__modsi3 (14 instructions) calls __divsi3 (18 instructions)
__divmodsi4 (17 instructions) calls __divsi3 (18 instructions)

__udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
                            division using shifts and additions!
__umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
__udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)

__muldi3 (41 instructions) performs a "long" multiplication on
                           16-bit "digits"

JFTR: I haven't checked whether clang actually calls these
      SUPERFLUOUS routines listed above.
      IT BETTER SHOULD NOT, NEVER!

__divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
__moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
__divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
                              calls __udivmoddi4 (254 instructions)
__udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
__umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)

JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
      better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
      __umoddi3 routines written in assembler, which SHOULD be
      shipped with clang_rt.builtins-i386.lib instead of the above
      listed POOR and NOT optimised implementations!

NOT AMUSED
Stefan Kanthak

PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
    has patches for the assembler routines!

PPS: please remove the blatant lie
     | The builtins library provides optimized implementations of
     | this and other low-level routines, either in target-independent
     | C form, or as a heavily-optimized assembly.
     seen on <https://compiler-rt.llvm.org/>
     These routines are NOT optimized, and for sure NOT heavily-
     optimized!
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
None of the "si" division routines will be used by x86. They exist for targets that don't support the operations natively. X86 supports them natively so will never use the library functions.

X86 has its own assembly implementation of __muldi3 that uses 32-bit pieces.

We should be using the assembly versions of the "di" division routines on i386. Except when compiler-rt is built with MSVC because MSVC can't parse the at&t assembly syntax.

~Craig


On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <[hidden email]> wrote:
Hi @ll,

LLVM-7.0.0-win32.exe contains and installs
lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib

The implementation of (at least) the multiplication and division
routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
libraries SUCKS: they are factors SLOWER than even Microsoft's
NOTORIOUS POOR implementation of 64-bit division shipped with
MSVC and Windows!

The reasons: 1. subroutine matroschka, 2. "C" implementation!

JFTR: the target processor "i386" (introduced October 1985) is
      a 32-bit processor, it has instructions to divide 64-bit
      integers by 32-bit integers, and to multiply two 32-bit
      integers giving a 64-bit product!
      I expect that a library written 20+ years later takes
      advantage of these instructions!

__divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
                           plus a final negation, instead of just
                           a single IDIV
__modsi3 (14 instructions) calls __divsi3 (18 instructions)
__divmodsi4 (17 instructions) calls __divsi3 (18 instructions)

__udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
                            division using shifts and additions!
__umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
__udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)

__muldi3 (41 instructions) performs a "long" multiplication on
                           16-bit "digits"

JFTR: I haven't checked whether clang actually calls these
      SUPERFLUOUS routines listed above.
      IT BETTER SHOULD NOT, NEVER!

__divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
__moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
__divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
                              calls __udivmoddi4 (254 instructions)
__udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
__umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)

JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
      better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
      __umoddi3 routines written in assembler, which SHOULD be
      shipped with clang_rt.builtins-i386.lib instead of the above
      listed POOR and NOT optimised implementations!

NOT AMUSED
Stefan Kanthak

PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
    has patches for the assembler routines!

PPS: please remove the blatant lie
     | The builtins library provides optimized implementations of
     | this and other low-level routines, either in target-independent
     | C form, or as a heavily-optimized assembly.
     seen on <https://compiler-rt.llvm.org/>
     These routines are NOT optimized, and for sure NOT heavily-
     optimized!
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
"Craig Topper" <[hidden email]> wrote:

> None of the "si" division routines will be used by x86.

That was my expectation too.

> They exist for targets that don't support the operations natively.
> X86 supports them natively so will never use the library functions.

So they SHOULD not be built (or at least not shipped) with the
builtins library for x86.

> X86 has its own assembly implementation of __muldi3 that uses 32-bit
> pieces.

I know; that's why I placed this ABOVE my "JFTR:"
 
> We should be using the assembly versions of the "di" division routines on
> i386. Except when compiler-rt is built with MSVC because MSVC can't parse
> the at&t assembly syntax.

Again: my offer to provide these routines still stands!

I have OPTIMISED __divdi3, __moddi3, __udivdi3 and __umoddi3 in
Intel syntax, wrapped as inline files into an NMakefile, for use
with ML.EXE.
For the optimisations see the patch I sent last week.

Since Howard Hinnant is NO MORE with LLVM: who is the CURRENT
code owner and reviewer for the builtins library, especially for
x86?

I'm asking this SIMPLE question now for the 3rd time!

I also have __udivmoddi3: adding the pointer to the remainder as
argument and 4 more instructions will turn it into __udivmoddi4.

Compiling them with MSVC is of course easy to achieve: remove the
MASM/ML statements, put the assembler source inside an __asm block,
and add a function definition with __declspec(naked)

But then someone will have to find new filenames; I'd prefer to
leave them as *.ASM, so they can be added to YOUR source tree
without clobbering existing files.

The same holds for __alldiv, __alldvrm, __allrem, __aulldiv,
__aulldvrm and __aullrem, plus __allmul, __allshl, _allshr and
__aullshr.

If you name a reviewer I'll send them to llvm-commits!

regards
Stefan

> On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <
> [hidden email]> wrote:
>
>> Hi @ll,
>>
>> LLVM-7.0.0-win32.exe contains and installs
>> lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib
>>
>> The implementation of (at least) the multiplication and division
>> routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
>> libraries SUCKS: they are factors SLOWER than even Microsoft's
>> NOTORIOUS POOR implementation of 64-bit division shipped with
>> MSVC and Windows!
>>
>> The reasons: 1. subroutine matroschka, 2. "C" implementation!
>>
>> JFTR: the target processor "i386" (introduced October 1985) is
>>       a 32-bit processor, it has instructions to divide 64-bit
>>       integers by 32-bit integers, and to multiply two 32-bit
>>       integers giving a 64-bit product!
>>       I expect that a library written 20+ years later takes
>>       advantage of these instructions!
>>
>> __divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
>>                            plus a final negation, instead of just
>>                            a single IDIV
>> __modsi3 (14 instructions) calls __divsi3 (18 instructions)
>> __divmodsi4 (17 instructions) calls __divsi3 (18 instructions)
>>
>> __udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
>>                             division using shifts and additions!
>> __umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
>> __udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)
>>
>> __muldi3 (41 instructions) performs a "long" multiplication on
>>                            16-bit "digits"
>>
>> JFTR: I haven't checked whether clang actually calls these
>>       SUPERFLUOUS routines listed above.
>>       IT BETTER SHOULD NOT, NEVER!
>>
>> __divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
>> __moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
>> __divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
>>                               calls __udivmoddi4 (254 instructions)
>> __udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
>> __umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)
>>
>> JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
>>       better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
>>       __umoddi3 routines written in assembler, which SHOULD be
>>       shipped with clang_rt.builtins-i386.lib instead of the above
>>       listed POOR and NOT optimised implementations!
>>
>> NOT AMUSED
>> Stefan Kanthak
>>
>> PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
>>     has patches for the assembler routines!
>>
>> PPS: please remove the blatant lie
>>      | The builtins library provides optimized implementations of
>>      | this and other low-level routines, either in target-independent
>>      | C form, or as a heavily-optimized assembly.
>>      seen on <https://compiler-rt.llvm.org/>
>>      These routines are NOT optimized, and for sure NOT heavily-
>>      optimized!
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
Reviewers: me, Simon Pilgrim, Sanjay Patel(the 3 most active X86 contributors) and probably Steve Canon since he wrote the original routines.

~Craig


On Mon, Dec 3, 2018 at 10:51 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> None of the "si" division routines will be used by x86.

That was my expectation too.

> They exist for targets that don't support the operations natively.
> X86 supports them natively so will never use the library functions.

So they SHOULD not be built (or at least not shipped) with the
builtins library for x86.

> X86 has its own assembly implementation of __muldi3 that uses 32-bit
> pieces.

I know; that's why I placed this ABOVE my "JFTR:"

> We should be using the assembly versions of the "di" division routines on
> i386. Except when compiler-rt is built with MSVC because MSVC can't parse
> the at&t assembly syntax.

Again: my offer to provide these routines still stands!

I have OPTIMISED __divdi3, __moddi3, __udivdi3 and __umoddi3 in
Intel syntax, wrapped as inline files into an NMakefile, for use
with ML.EXE.
For the optimisations see the patch I sent last week.

Since Howard Hinnant is NO MORE with LLVM: who is the CURRENT
code owner and reviewer for the builtins library, especially for
x86?

I'm asking this SIMPLE question now for the 3rd time!

I also have __udivmoddi3: adding the pointer to the remainder as
argument and 4 more instructions will turn it into __udivmoddi4.

Compiling them with MSVC is of course easy to achieve: remove the
MASM/ML statements, put the assembler source inside an __asm block,
and add a function definition with __declspec(naked)

But then someone will have to find new filenames; I'd prefer to
leave them as *.ASM, so they can be added to YOUR source tree
without clobbering existing files.

The same holds for __alldiv, __alldvrm, __allrem, __aulldiv,
__aulldvrm and __aullrem, plus __allmul, __allshl, _allshr and
__aullshr.

If you name a reviewer I'll send them to llvm-commits!

regards
Stefan

> On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <
> [hidden email]> wrote:
>
>> Hi @ll,
>>
>> LLVM-7.0.0-win32.exe contains and installs
>> lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib
>>
>> The implementation of (at least) the multiplication and division
>> routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
>> libraries SUCKS: they are factors SLOWER than even Microsoft's
>> NOTORIOUS POOR implementation of 64-bit division shipped with
>> MSVC and Windows!
>>
>> The reasons: 1. subroutine matroschka, 2. "C" implementation!
>>
>> JFTR: the target processor "i386" (introduced October 1985) is
>>       a 32-bit processor, it has instructions to divide 64-bit
>>       integers by 32-bit integers, and to multiply two 32-bit
>>       integers giving a 64-bit product!
>>       I expect that a library written 20+ years later takes
>>       advantage of these instructions!
>>
>> __divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
>>                            plus a final negation, instead of just
>>                            a single IDIV
>> __modsi3 (14 instructions) calls __divsi3 (18 instructions)
>> __divmodsi4 (17 instructions) calls __divsi3 (18 instructions)
>>
>> __udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
>>                             division using shifts and additions!
>> __umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
>> __udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)
>>
>> __muldi3 (41 instructions) performs a "long" multiplication on
>>                            16-bit "digits"
>>
>> JFTR: I haven't checked whether clang actually calls these
>>       SUPERFLUOUS routines listed above.
>>       IT BETTER SHOULD NOT, NEVER!
>>
>> __divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
>> __moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
>> __divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
>>                               calls __udivmoddi4 (254 instructions)
>> __udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
>> __umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)
>>
>> JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
>>       better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
>>       __umoddi3 routines written in assembler, which SHOULD be
>>       shipped with clang_rt.builtins-i386.lib instead of the above
>>       listed POOR and NOT optimised implementations!
>>
>> NOT AMUSED
>> Stefan Kanthak
>>
>> PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
>>     has patches for the assembler routines!
>>
>> PPS: please remove the blatant lie
>>      | The builtins library provides optimized implementations of
>>      | this and other low-level routines, either in target-independent
>>      | C form, or as a heavily-optimized assembly.
>>      seen on <https://compiler-rt.llvm.org/>
>>      These routines are NOT optimized, and for sure NOT heavily-
>>      optimized!
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
In reply to this post by Alberto Barbaro via llvm-dev


> On Dec 3, 2018, at 10:50 AM, Stefan Kanthak via llvm-dev <[hidden email]> wrote:
>
> "Craig Topper" <[hidden email]> wrote:
>
>> None of the "si" division routines will be used by x86.
>
> That was my expectation too.
>
>> They exist for targets that don't support the operations natively.
>> X86 supports them natively so will never use the library functions.
>
> So they SHOULD not be built (or at least not shipped) with the
> builtins library for x86.

I think you will find that down this path lies madness. Apple has tried for many years to limit which builtins get shipped in compiler-rt to just the smallest correct set to reduce the distribution size of clang. Over the years we've taken several different approaches, and they are all error prone and result in bugs. This problem stems from the fact that generation of builtin calls can be triggered by optimization settings, architecture, ABI, or the lunar cycle.

Initially we (Apple) maintained per-architecture lists of builtins, but those lists wouldn't get updated when new builtins got added, and we'd get bugs often after we shipped. Then I moved to an inverted system where we maintained lists to exclude, allowing that all new builtins always got added, but that has turned out to be a mess because it is really hard to know if it is safe to exclude something, and *oh wait the compiler changed and now it isn't safe anymore*.

IMO, and coming from some painful experience, I think including all builtin functions is the easiest way to make less buggy release, until someone comes along and comes up with a definitive way for us to always know if a given builtin is possible to generate with a given compiler.

-Chris

>
>> X86 has its own assembly implementation of __muldi3 that uses 32-bit
>> pieces.
>
> I know; that's why I placed this ABOVE my "JFTR:"
>
>> We should be using the assembly versions of the "di" division routines on
>> i386. Except when compiler-rt is built with MSVC because MSVC can't parse
>> the at&t assembly syntax.
>
> Again: my offer to provide these routines still stands!
>
> I have OPTIMISED __divdi3, __moddi3, __udivdi3 and __umoddi3 in
> Intel syntax, wrapped as inline files into an NMakefile, for use
> with ML.EXE.
> For the optimisations see the patch I sent last week.
>
> Since Howard Hinnant is NO MORE with LLVM: who is the CURRENT
> code owner and reviewer for the builtins library, especially for
> x86?
>
> I'm asking this SIMPLE question now for the 3rd time!
>
> I also have __udivmoddi3: adding the pointer to the remainder as
> argument and 4 more instructions will turn it into __udivmoddi4.
>
> Compiling them with MSVC is of course easy to achieve: remove the
> MASM/ML statements, put the assembler source inside an __asm block,
> and add a function definition with __declspec(naked)
>
> But then someone will have to find new filenames; I'd prefer to
> leave them as *.ASM, so they can be added to YOUR source tree
> without clobbering existing files.
>
> The same holds for __alldiv, __alldvrm, __allrem, __aulldiv,
> __aulldvrm and __aullrem, plus __allmul, __allshl, _allshr and
> __aullshr.
>
> If you name a reviewer I'll send them to llvm-commits!
>
> regards
> Stefan
>
>> On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <
>> [hidden email]> wrote:
>>
>>> Hi @ll,
>>>
>>> LLVM-7.0.0-win32.exe contains and installs
>>> lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib
>>>
>>> The implementation of (at least) the multiplication and division
>>> routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
>>> libraries SUCKS: they are factors SLOWER than even Microsoft's
>>> NOTORIOUS POOR implementation of 64-bit division shipped with
>>> MSVC and Windows!
>>>
>>> The reasons: 1. subroutine matroschka, 2. "C" implementation!
>>>
>>> JFTR: the target processor "i386" (introduced October 1985) is
>>>      a 32-bit processor, it has instructions to divide 64-bit
>>>      integers by 32-bit integers, and to multiply two 32-bit
>>>      integers giving a 64-bit product!
>>>      I expect that a library written 20+ years later takes
>>>      advantage of these instructions!
>>>
>>> __divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
>>>                           plus a final negation, instead of just
>>>                           a single IDIV
>>> __modsi3 (14 instructions) calls __divsi3 (18 instructions)
>>> __divmodsi4 (17 instructions) calls __divsi3 (18 instructions)
>>>
>>> __udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
>>>                            division using shifts and additions!
>>> __umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
>>> __udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)
>>>
>>> __muldi3 (41 instructions) performs a "long" multiplication on
>>>                           16-bit "digits"
>>>
>>> JFTR: I haven't checked whether clang actually calls these
>>>      SUPERFLUOUS routines listed above.
>>>      IT BETTER SHOULD NOT, NEVER!
>>>
>>> __divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
>>> __moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
>>> __divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
>>>                              calls __udivmoddi4 (254 instructions)
>>> __udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
>>> __umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)
>>>
>>> JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
>>>      better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
>>>      __umoddi3 routines written in assembler, which SHOULD be
>>>      shipped with clang_rt.builtins-i386.lib instead of the above
>>>      listed POOR and NOT optimised implementations!
>>>
>>> NOT AMUSED
>>> Stefan Kanthak
>>>
>>> PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
>>>    has patches for the assembler routines!
>>>
>>> PPS: please remove the blatant lie
>>>     | The builtins library provides optimized implementations of
>>>     | this and other low-level routines, either in target-independent
>>>     | C form, or as a heavily-optimized assembly.
>>>     seen on <https://compiler-rt.llvm.org/>
>>>     These routines are NOT optimized, and for sure NOT heavily-
>>>     optimized!
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
"Chris Bieneman" <[hidden email]> wrote:

>> On Dec 3, 2018, at 10:50 AM, Stefan Kanthak via llvm-dev <[hidden email]> wrote:
>>
>> "Craig Topper" <[hidden email]> wrote:
>>
>>> None of the "si" division routines will be used by x86.
>>
>> That was my expectation too.
>>
>>> They exist for targets that don't support the operations natively.
>>> X86 supports them natively so will never use the library functions.
>>
>> So they SHOULD not be built (or at least not shipped) with the
>> builtins library for x86.
>
> I think you will find that down this path lies madness. Apple has
> tried for many years to limit which builtins get shipped in compiler-rt
> to just the smallest correct set to reduce the distribution size of
> clang. Over the years we've taken several different approaches, and
> they are all error prone and result in bugs. This problem stems from
> the fact that generation of builtin calls can be triggered by
> optimization settings, architecture, ABI, or the lunar cycle.
>
> Initially we (Apple) maintained per-architecture lists of builtins,
> but those lists wouldn't get updated when new builtins got added,
> and we'd get bugs often after we shipped. Then I moved to an inverted
> system where we maintained lists to exclude, allowing that all new
> builtins always got added, but that has turned out to be a mess
> because it is really hard to know if it is safe to exclude something,
> and *oh wait the compiler changed and now it isn't safe anymore*.
>
> IMO, and coming from some painful experience, I think including all
> builtin functions is the easiest way to make less buggy release,
> until someone comes along and comes up with a definitive way for us
> to always know if a given builtin is possible to generate with a given
> compiler.

Thanks for this background information.
Now if such design decisions, and/or their outcome, like (for example)

| compiler-rt ships with all routines available for the supported
| architectures

were published on the compiler-rt web pages, I won't need to ask.

regards
Stefan

>> X86 has its own assembly implementation of __muldi3 that uses 32-bit
>> pieces.
>
> I know; that's why I placed this ABOVE my "JFTR:"
>
>> We should be using the assembly versions of the "di" division routines on
>> i386. Except when compiler-rt is built with MSVC because MSVC can't parse
>> the at&t assembly syntax.
>
> Again: my offer to provide these routines still stands!
>
> I have OPTIMISED __divdi3, __moddi3, __udivdi3 and __umoddi3 in
> Intel syntax, wrapped as inline files into an NMakefile, for use
> with ML.EXE.
> For the optimisations see the patch I sent last week.
>
> Since Howard Hinnant is NO MORE with LLVM: who is the CURRENT
> code owner and reviewer for the builtins library, especially for
> x86?
>
> I'm asking this SIMPLE question now for the 3rd time!
>
> I also have __udivmoddi3: adding the pointer to the remainder as
> argument and 4 more instructions will turn it into __udivmoddi4.
>
> Compiling them with MSVC is of course easy to achieve: remove the
> MASM/ML statements, put the assembler source inside an __asm block,
> and add a function definition with __declspec(naked)
>
> But then someone will have to find new filenames; I'd prefer to
> leave them as *.ASM, so they can be added to YOUR source tree
> without clobbering existing files.
>
> The same holds for __alldiv, __alldvrm, __allrem, __aulldiv,
> __aulldvrm and __aullrem, plus __allmul, __allshl, _allshr and
> __aullshr.
>
> If you name a reviewer I'll send them to llvm-commits!
>
> regards
> Stefan
>
>> On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <
>> [hidden email]> wrote:
>>
>>> Hi @ll,
>>>
>>> LLVM-7.0.0-win32.exe contains and installs
>>> lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib
>>>
>>> The implementation of (at least) the multiplication and division
>>> routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
>>> libraries SUCKS: they are factors SLOWER than even Microsoft's
>>> NOTORIOUS POOR implementation of 64-bit division shipped with
>>> MSVC and Windows!
>>>
>>> The reasons: 1. subroutine matroschka, 2. "C" implementation!
>>>
>>> JFTR: the target processor "i386" (introduced October 1985) is
>>>      a 32-bit processor, it has instructions to divide 64-bit
>>>      integers by 32-bit integers, and to multiply two 32-bit
>>>      integers giving a 64-bit product!
>>>      I expect that a library written 20+ years later takes
>>>      advantage of these instructions!
>>>
>>> __divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
>>>                           plus a final negation, instead of just
>>>                           a single IDIV
>>> __modsi3 (14 instructions) calls __divsi3 (18 instructions)
>>> __divmodsi4 (17 instructions) calls __divsi3 (18 instructions)
>>>
>>> __udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
>>>                            division using shifts and additions!
>>> __umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
>>> __udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)
>>>
>>> __muldi3 (41 instructions) performs a "long" multiplication on
>>>                           16-bit "digits"
>>>
>>> JFTR: I haven't checked whether clang actually calls these
>>>      SUPERFLUOUS routines listed above.
>>>      IT BETTER SHOULD NOT, NEVER!
>>>
>>> __divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
>>> __moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
>>> __divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
>>>                              calls __udivmoddi4 (254 instructions)
>>> __udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
>>> __umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)
>>>
>>> JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
>>>      better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
>>>      __umoddi3 routines written in assembler, which SHOULD be
>>>      shipped with clang_rt.builtins-i386.lib instead of the above
>>>      listed POOR and NOT optimised implementations!
>>>
>>> NOT AMUSED
>>> Stefan Kanthak
>>>
>>> PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
>>>    has patches for the assembler routines!
>>>
>>> PPS: please remove the blatant lie
>>>     | The builtins library provides optimized implementations of
>>>     | this and other low-level routines, either in target-independent
>>>     | C form, or as a heavily-optimized assembly.
>>>     seen on <https://compiler-rt.llvm.org/>
>>>     These routines are NOT optimized, and for sure NOT heavily-
>>>     optimized!
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Alberto Barbaro via llvm-dev
In reply to this post by Alberto Barbaro via llvm-dev
On Mon, Dec 3, 2018 at 2:51 PM Stefan Kanthak via llvm-dev <[hidden email]> wrote:
Hi @ll,

LLVM-7.0.0-win32.exe contains and installs
lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib

The implementation of (at least) the multiplication and division
routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
libraries SUCKS: they are factors SLOWER than even Microsoft's
NOTORIOUS POOR implementation of 64-bit division shipped with
MSVC and Windows!

I'm really happy that you're looking at making some of these routines better (and Craig and others have given excellent suggestions about how to go about this.

But in the future, please be more polite and respectful on the LLVM mailing lists. Insults, all-capital-letters, and inflammatory language are unnecessary and unhelpful in our community. You can tell us that "the performance is really bad" and we'll actually take that more seriously than the phrasing you've used in this email.

Anyways, also looking forward to the improvements to this area.

-Chandler

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev