Macro-op fusion experiment

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Macro-op fusion experiment

Nicolas Capens-2

Hi all,

 

x86 processors use macro-op fusion to merge together two instructions and execute them as one. So it's beneficial for the compiler to emit them as a pair.

 

Currently only compare and jump instructions get fused though. And I was wondering whether it also makes sense to fuse move and arithmetic instructions together, to form non-destructive instructions (which x86 lacks for regular instructions). For instance:

                8B C3 mov eax, ebx 

                03 C1 add eax, ecx

becomes

                8B C3 03 C1 add eax, ebx, ecx

 

There's no difference in the binary encoding; it's just considered one instruction at a logical level and inside the hardware (I'm assuming x86's RISC internals actually use non-destructive micro-operations).

 

So my question is, how do I define these fused instructions in LLVM? And how would I be able to estimate the potential speedup?

 

Thanks,

Nicolas


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Jakob Stoklund Olesen-2

On Apr 8, 2011, at 3:29 AM, Nicolas Capens wrote:

> x86 processors use macro-op fusion to merge together two instructions and execute them as one. So it's beneficial for the compiler to emit them as a pair.
>  
> Currently only compare and jump instructions get fused though. And I was wondering whether it also makes sense to fuse move and arithmetic instructions together, to form non-destructive instructions (which x86 lacks for regular instructions). For instance:
>                 8B C3 mov eax, ebx
>                 03 C1 add eax, ecx
> becomes
>                 8B C3 03 C1 add eax, ebx, ecx
>  
> There's no difference in the binary encoding; it's just considered one instruction at a logical level and inside the hardware (I'm assuming x86's RISC internals actually use non-destructive micro-operations).

Most x86 implementations use register renaming these days, so micro-operations are non-destructive, but they don't refer to architectural registers. They refer to a larger number of real registers.

Register copies are mostly free to execute except they increase code size and consume decoder resources. To my knowledge, they are not fused in the way you describe.

Intel's optimization reference manual describes which instructions can be fused. The Sandy Bridge processors fuse more pairs than previous generations, but the second instruction is always a conditional branch.

There is no need to define pseudo-instructions to support this. If you want to experiment, you could add a late pass that tries to form fusable pairs by pushing instructions down to the conditional branch. This should happen after register allocation where code is often inserted before a branch.

I would be interested to see the performance impact of such a pass.

/jakob

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

NAKAMURA Takumi
>>                 8B C3 mov eax, ebx
>>                 03 C1 add eax, ecx
>> becomes
>>                 8B C3 03 C1 add eax, ebx, ecx

In my understanding, twoaddr pass tends to emit such a sequence.

Though I don't have sandybridge, I have not measured.
Prior processors(intel and amd) might spend 1 ALU to execute "mov",
then mov - add must have dependency.

In contrast, the sequence below might be executed in parallel;
mov %ebx, %eax
add %ecx, %ebx
(I understand it might not be applicable in all cases)
Thoughts?

...Takumi

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Jakob Stoklund Olesen-2

On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:

>>>                 8B C3 mov eax, ebx
>>>                 03 C1 add eax, ecx
>>> becomes
>>>                 8B C3 03 C1 add eax, ebx, ecx
>
> In my understanding, twoaddr pass tends to emit such a sequence.

Yes, it always does, and the coalescer tries very hard to eliminate the copy.

> Though I don't have sandybridge, I have not measured.
> Prior processors(intel and amd) might spend 1 ALU to execute "mov",
> then mov - add must have dependency.

I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource.

You should read about the 'reservation station' style register renaming.

http://en.wikipedia.org/wiki/Register_renaming
http://www.intel.com/Assets/PDF/manual/248966.pdf

/jakob

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Nicolas Capens-2
Hi Jacob,

As far as I know, an x86 'mov' instruction always

On 08 Apr 2011, at 19:27, Jakob Stoklund Olesen <[hidden email]> wrote:

>
> On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:
>
>>>>                8B C3 mov eax, ebx
>>>>                03 C1 add eax, ecx
>>>> becomes
>>>>                8B C3 03 C1 add eax, ebx, ecx
>>
>> In my understanding, twoaddr pass tends to emit such a sequence.
>
> Yes, it always does, and the coalescer tries very hard to eliminate the copy.
>
>> Though I don't have sandybridge, I have not measured.
>> Prior processors(intel and amd) might spend 1 ALU to execute "mov",
>> then mov - add must have dependency.
>
> I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource.
>
> You should read about the 'reservation station' style register renaming.
>
> http://en.wikipedia.org/wiki/Register_renaming
> http://www.intel.com/Assets/PDF/manual/248966.pdf
>
> /jakob
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Nicolas Capens-2
In reply to this post by Jakob Stoklund Olesen-2
Hi Jacob,

As far as I know, an x86 'mov' instruction always uses an ALU resource. According to Agner Fog's documents (
http://www.agner.org/optimize/), it can execute on port 0, 1 or 5 on recent architectures though. So it's not that likely to be resource limited. But it still occupies an instruction slot throughout the entire pipeline, costing power and potentially limiting other actual arithmetic instructions from scheduling optimally. Also, it has a latency of 1 cycle, while non-destructive instructions would shorten the latency of dependent instructions.

My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it's worth the added decoder complexity.

Cheers,
Nicolas


On Fri, Apr 8, 2011 at 7:27 PM, Jakob Stoklund Olesen <[hidden email]> wrote:

On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:

>>>                 8B C3 mov eax, ebx
>>>                 03 C1 add eax, ecx
>>> becomes
>>>                 8B C3 03 C1 add eax, ebx, ecx
>
> In my understanding, twoaddr pass tends to emit such a sequence.

Yes, it always does, and the coalescer tries very hard to eliminate the copy.

> Though I don't have sandybridge, I have not measured.
> Prior processors(intel and amd) might spend 1 ALU to execute "mov",
> then mov - add must have dependency.

I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource.

You should read about the 'reservation station' style register renaming.

http://en.wikipedia.org/wiki/Register_renaming
http://www.intel.com/Assets/PDF/manual/248966.pdf

/jakob



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Jakob Stoklund Olesen-2

On Apr 17, 2011, at 9:59 AM, Nicolas Capens wrote:

> My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it's worth the added decoder complexity.

In that case, just look at the generated code. I don't think any pass is inserting instructions between 'mov' and two-address arithmetic instructions.

/jakob


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Macro-op fusion experiment

Nicolas Capens-2
Hi Jacob,

I've considered that, but it would require either decoding the binary code, or detecting these instruction pairs during assembly output. I'm not sure which would work best and how to get started. Any and all ideas are greatly appreciated.

Thanks!
Nicolas


On 17 Apr 2011, at 20:19, Jakob Stoklund Olesen <[hidden email]> wrote:

>
> On Apr 17, 2011, at 9:59 AM, Nicolas Capens wrote:
>
>> My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it's worth the added decoder complexity.
>
> In that case, just look at the generated code. I don't think any pass is inserting instructions between 'mov' and two-address arithmetic instructions.
>
> /jakob
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev