[llvm-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
Hi,

For some maybe dumb reasons I try to write a portable version of int128.

What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.

Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.

Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
If not, would it be good idea to teach LLVM about it?

Bests,
Paweł

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
Hi Pawel,

There is the _mulx_u64 intrinsic, but it currently requires the hardware flag "-mbmi2".

On Clang 3.8.1 and earlier, the _addcarry_u64 and _subborrow_u64 intrinsics required the hardware flag `-madx`, even though they didn't use the hardware ADX/ADOX instructions. Modern GCC and Clang permit the use of these intrinsics (to generate ADC) even in the absence of `-madx`.

I think it would be a very good idea for Clang to support _mulx_u64 (to generate MUL) even in the absence of `-mbmi2`.

–Arthur


On Sat, Dec 29, 2018 at 6:03 PM Paweł Bylica via cfe-dev <[hidden email]> wrote:
Hi,

For some maybe dumb reasons I try to write a portable version of int128.

What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.

Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.

Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
If not, would it be good idea to teach LLVM about it?

Bests,
Paweł
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
_mulx_u64 only exists when the target is x86_64. That's still not very portable. I'm not opposed to removing the bmi2 check, but gcc also has the same check so it doesn't improve portability much.

~Craig


On Sat, Dec 29, 2018 at 4:44 PM Arthur O'Dwyer via llvm-dev <[hidden email]> wrote:
Hi Pawel,

There is the _mulx_u64 intrinsic, but it currently requires the hardware flag "-mbmi2".

On Clang 3.8.1 and earlier, the _addcarry_u64 and _subborrow_u64 intrinsics required the hardware flag `-madx`, even though they didn't use the hardware ADX/ADOX instructions. Modern GCC and Clang permit the use of these intrinsics (to generate ADC) even in the absence of `-madx`.

I think it would be a very good idea for Clang to support _mulx_u64 (to generate MUL) even in the absence of `-mbmi2`.

–Arthur


On Sat, Dec 29, 2018 at 6:03 PM Paweł Bylica via cfe-dev <[hidden email]> wrote:
Hi,

For some maybe dumb reasons I try to write a portable version of int128.

What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.

Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.

Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
If not, would it be good idea to teach LLVM about it?

Bests,
Paweł
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
Hi Arthur, Craig,

Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
Is there a one for regular MUL?

Anyway, I want to go the opposite direction. If I can I relay on compiler's optimizations. If I want to use MULX in Clang I do it like that:

unsigned long mulx(unsigned long x, unsigned long y, unsigned long* hi)
{
auto p = (unsigned __int128){x} * y;
*hi = static_cast<unsigned long>(p >> 64);
return static_cast<unsigned long>(p);
}

https://godbolt.org/z/PbgFb9

If compiled with -mbmi2 -mtune=generic it just uses MULX instruction.

mulx(unsigned long, unsigned long, unsigned long*):
mov rcx, rdx
mov rdx, rsi
mulx rdx, rax, rdi
mov qword ptr [rcx], rdx
ret

What I want to do it move it further - rewrite the above mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.

A possible implementation looks like


uint64_t mul_full_64_generic(uint64_t x, uint64_t y, uint64_t* hi)
{
uint64_t xl = x & 0xffffffff;
uint64_t xh = x >> 32;
uint64_t yl = y & 0xffffffff;
uint64_t yh = y >> 32;

uint64_t t = xl * yl;
uint64_t l = t & 0xffffffff;
uint64_t h = t >> 32;

t = xh * yl;
t += h;
h = t >> 32;

t = xl * yh + (t & 0xffffffff);
l |= t << 32;
*hi = xh * yh + h + (t >> 32);
return l;
}

As expected, Clang is not able to match this pattern currently. 

If we want to implement this optimization in Clang, there are some questions I have:
1. Can we prove this pattern is equivalent of MUL 64x64 -> 128?
2. What pass this optimization should be added to?
3. Can this pattern be split into smaller ones? E.g. UMULH.

Paweł



On Sun, Dec 30, 2018 at 2:34 AM Craig Topper <[hidden email]> wrote:
_mulx_u64 only exists when the target is x86_64. That's still not very portable. I'm not opposed to removing the bmi2 check, but gcc also has the same check so it doesn't improve portability much.

~Craig


On Sat, Dec 29, 2018 at 4:44 PM Arthur O'Dwyer via llvm-dev <[hidden email]> wrote:
Hi Pawel,

There is the _mulx_u64 intrinsic, but it currently requires the hardware flag "-mbmi2".

On Clang 3.8.1 and earlier, the _addcarry_u64 and _subborrow_u64 intrinsics required the hardware flag `-madx`, even though they didn't use the hardware ADX/ADOX instructions. Modern GCC and Clang permit the use of these intrinsics (to generate ADC) even in the absence of `-madx`.

I think it would be a very good idea for Clang to support _mulx_u64 (to generate MUL) even in the absence of `-mbmi2`.

–Arthur


On Sat, Dec 29, 2018 at 6:03 PM Paweł Bylica via cfe-dev <[hidden email]> wrote:
Hi,

For some maybe dumb reasons I try to write a portable version of int128.

What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.

Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.

Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
If not, would it be good idea to teach LLVM about it?

Bests,
Paweł
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
On Mon, Dec 31, 2018 at 12:46 AM Paweł Bylica via cfe-dev
<[hidden email]> wrote:

>
> Hi Arthur, Craig,
>
> Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
> Is there a one for regular MUL?
>
> Anyway, I want to go the opposite direction. If I can I relay on compiler's optimizations. If I want to use MULX in Clang I do it like that:
>
> unsigned long mulx(unsigned long x, unsigned long y, unsigned long* hi)
> {
> auto p = (unsigned __int128){x} * y;
> *hi = static_cast<unsigned long>(p >> 64);
> return static_cast<unsigned long>(p);
> }
>
> https://godbolt.org/z/PbgFb9
>
> If compiled with -mbmi2 -mtune=generic it just uses MULX instruction.
>
> mulx(unsigned long, unsigned long, unsigned long*):
> mov rcx, rdx
> mov rdx, rsi
> mulx rdx, rax, rdi
> mov qword ptr [rcx], rdx
> ret
>
> What I want to do it move it further - rewrite the above mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
>
> A possible implementation looks like
>
>
> uint64_t mul_full_64_generic(uint64_t x, uint64_t y, uint64_t* hi)
> {
> uint64_t xl = x & 0xffffffff;
> uint64_t xh = x >> 32;
> uint64_t yl = y & 0xffffffff;
> uint64_t yh = y >> 32;
>
> uint64_t t = xl * yl;
> uint64_t l = t & 0xffffffff;
> uint64_t h = t >> 32;
>
> t = xh * yl;
> t += h;
> h = t >> 32;
>
> t = xl * yh + (t & 0xffffffff);
> l |= t << 32;
> *hi = xh * yh + h + (t >> 32);
> return l;
> }
>
> As expected, Clang is not able to match this pattern currently.
>
> If we want to implement this optimization in Clang, there are some questions I have:
> 3. Can this pattern be split into smaller ones? E.g. UMULH.
The transform should be done on LLVM IR.

> 2. What pass this optimization should be added to?
Compare:
https://godbolt.org/z/Blx1tp
basically, you need to match&transform the latter LLVM IR into the former.

Where?
I'm not sure there is a dedicated place for such transforms.
I would guess AggressiveInstCombine, since that pattern is somewhat large.

> 1. Can we prove this pattern is equivalent of MUL 64x64 -> 128?
You can, by feeding both of these IR's into https://github.com/nunoplopes/alive

> Paweł
Roman.

> On Sun, Dec 30, 2018 at 2:34 AM Craig Topper <[hidden email]> wrote:
>>
>> _mulx_u64 only exists when the target is x86_64. That's still not very portable. I'm not opposed to removing the bmi2 check, but gcc also has the same check so it doesn't improve portability much.
>>
>> ~Craig
>>
>>
>> On Sat, Dec 29, 2018 at 4:44 PM Arthur O'Dwyer via llvm-dev <[hidden email]> wrote:
>>>
>>> Hi Pawel,
>>>
>>> There is the _mulx_u64 intrinsic, but it currently requires the hardware flag "-mbmi2".
>>> https://github.com/Quuxplusone/WideIntProofOfConcept/blob/master/wider.h#L89-L99
>>>
>>> On Clang 3.8.1 and earlier, the _addcarry_u64 and _subborrow_u64 intrinsics required the hardware flag `-madx`, even though they didn't use the hardware ADX/ADOX instructions. Modern GCC and Clang permit the use of these intrinsics (to generate ADC) even in the absence of `-madx`.
>>>
>>> I think it would be a very good idea for Clang to support _mulx_u64 (to generate MUL) even in the absence of `-mbmi2`.
>>>
>>> –Arthur
>>>
>>>
>>> On Sat, Dec 29, 2018 at 6:03 PM Paweł Bylica via cfe-dev <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> For some maybe dumb reasons I try to write a portable version of int128.
>>>>
>>>> What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.
>>>>
>>>> Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.
>>>>
>>>> Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
>>>> If not, would it be good idea to teach LLVM about it?
>>>>
>>>> Bests,
>>>> Paweł
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> [hidden email]
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
In reply to this post by Finkel, Hal J. via llvm-dev
On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <[hidden email]> wrote:
Hi Arthur, Craig,

Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
Is there a one for regular MUL?

I'm not sure, but I think there currently does not exist any intrinsic to generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64` intrinsic to generate a regular MUL instruction; similar to how_addcarry_u64 generates ADCX/ADOX when available/useful and a regular ADC otherwise.
MSVC calls this intrinsic `_umul128`, and on MSVC it does generate a regular MUL instruction rather than forcing MULX.


Anyway, I want to go the opposite direction. [...] mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
A possible implementation looks like [SNIPPED]
 
Interesting trivia: There are at least three ways to write the final "return" statement in this function. Clang generates different code for each one of them. If someone does pursue writing an InstCombine optimization for this, it would be good to generate the same efficient code for all three versions.

–Arthur

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
On trunk we never generate MULX. We used to blindly use it anytime bmi2 was enabled, but its a longer encoding and isn't a guaranteed register allocation improvement. So I took it out a few weeks ago. We need a more precise heuristic for when to use it.

LLVM trunk will never generate ADCX/ADOX either. This was removed in September. We used to inconsistently generate them when adx was enabled unless we could use the RMW form or the immediate form of ADC. But that didn't really make any sense. The only reason to use ADCX or ADOX is when you want to carefully manage the flags to have two interleaved dependency chains. But that would require a special analysis to determine when to do that and we don't have that.

~Craig


On Mon, Dec 31, 2018 at 1:21 PM Arthur O'Dwyer <[hidden email]> wrote:
On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <[hidden email]> wrote:
Hi Arthur, Craig,

Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
Is there a one for regular MUL?

I'm not sure, but I think there currently does not exist any intrinsic to generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64` intrinsic to generate a regular MUL instruction; similar to how_addcarry_u64 generates ADCX/ADOX when available/useful and a regular ADC otherwise.
MSVC calls this intrinsic `_umul128`, and on MSVC it does generate a regular MUL instruction rather than forcing MULX.


Anyway, I want to go the opposite direction. [...] mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
A possible implementation looks like [SNIPPED]
 
Interesting trivia: There are at least three ways to write the final "return" statement in this function. Clang generates different code for each one of them. If someone does pursue writing an InstCombine optimization for this, it would be good to generate the same efficient code for all three versions.

–Arthur

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Finkel, Hal J. via llvm-dev
Thanks again for all the comments.

As suggested, I created my first pattern match in AggresiveInstCombine: https://reviews.llvm.org/D56214.

Suggestions welcome (preferably in the review).

Bests,
Paweł

On Mon, Dec 31, 2018 at 10:41 PM Craig Topper <[hidden email]> wrote:
On trunk we never generate MULX. We used to blindly use it anytime bmi2 was enabled, but its a longer encoding and isn't a guaranteed register allocation improvement. So I took it out a few weeks ago. We need a more precise heuristic for when to use it.

LLVM trunk will never generate ADCX/ADOX either. This was removed in September. We used to inconsistently generate them when adx was enabled unless we could use the RMW form or the immediate form of ADC. But that didn't really make any sense. The only reason to use ADCX or ADOX is when you want to carefully manage the flags to have two interleaved dependency chains. But that would require a special analysis to determine when to do that and we don't have that.

~Craig


On Mon, Dec 31, 2018 at 1:21 PM Arthur O'Dwyer <[hidden email]> wrote:
On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <[hidden email]> wrote:
Hi Arthur, Craig,

Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
Is there a one for regular MUL?

I'm not sure, but I think there currently does not exist any intrinsic to generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64` intrinsic to generate a regular MUL instruction; similar to how_addcarry_u64 generates ADCX/ADOX when available/useful and a regular ADC otherwise.
MSVC calls this intrinsic `_umul128`, and on MSVC it does generate a regular MUL instruction rather than forcing MULX.


Anyway, I want to go the opposite direction. [...] mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
A possible implementation looks like [SNIPPED]
 
Interesting trivia: There are at least three ways to write the final "return" statement in this function. Clang generates different code for each one of them. If someone does pursue writing an InstCombine optimization for this, it would be good to generate the same efficient code for all three versions.

–Arthur

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev