Introducing a branch optimization and prediction pass

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Introducing a branch optimization and prediction pass

Török Edwin
Hi,

I would like to transform unpredictable conditional branches with a
small body into cmov instructions to reduce branch miss penalty.
LLVM generates cmov/setcc instructions when SelectInst is used. The idea
is to transform regular branches into selects, when that is possible and
profitable.
However this needs branch prediction info, and it seems LLVM doesn't
provide that AFAICT.
Could gcc's branch predictor be  used here, or are there plans for an
llvm-specific branch predictor?

This has performance benefits if the branch is otherwise unpredictable,
for example:

for(i=0;i<len;i++) {
  if(isalnum(buf[i]))
    *p++ = buf[i];
}

Assume the input data is binary data, you can't know in advance if the
branch is taken or not. Either prediction will lead to a miss penalty X%
of the time.
The conditional write can be transformed using a dummy variable [2].

But, if the branch's body is big, or if is statically/dynamically
predictable this optimization must not be applied.
This optimization needs the following:
* determine if the transformation can be applied:
    * if the body has call/return/asm it cannot be applied
    * if volatile data is accesed within the loop
    * any other situations?
* determine if branch is statically/dynamically predictable:
    * statically: if compiler can predict branch, don't apply this
optimization
    * dynamically: can branch be predicted based on execution history
(i.e. can CPU predict it?)
* how to determine if a branch's body is small?
    if execution_cost_of(body) +
execution_cost_of(cmov)/branch_miss_probability < branch_miss_penalty,
    or simply: execution_cost_of(body) < branch_miss_penalty
    If it is bigger, then it is faster if we take the miss penalty (and
don't execute the body)
    But miss-penalties, and execution_costs are target specific, so use
TargetData here (and teach it about penalties), or
    use some heuristic based on number of instruction in basic block?
* if we cannot transform all instructions from the basic block into
instr+cmov, then it is not worth applying this optimization (since we'll
have a branch anyway).
 Does Codegen guarantee to always transform Select into CMOV/SETCC (or
equivalent), on targets that support it?
* if target doesn't support conditional moves, we must not apply this
transformation
* memory ordering/multi-thread safety:
if(pthread_mutex_trylock(...)) { *p++ = 4;}
Transforming this into a cmov should be safe, because we are using a
dummy variable on the stack as destination [2] if condition is false.
What if we have memory reads, those might not be always be safe to move
out of the branch's body (because the condition could be required to
ensure the access is valid),
or it could actually decrease performance in case the guarding condition
is to prevent L2 cache misses, example:
if(fast_filter_table[value] == match && slow_filter_table[value] ==
match) { ... do something ... }

If we unconditionally read from slow_filter_table it could actually
reduce the performance (assume slow_filter_table is huge), and might not
be legal because we violate
the short-circuit evaluation.

(BTW, the branch predictor should say it is predictable, and we wouldn't
apply the transformation at all, but we cannot rely on the branch
predictor to make
short-circuit eval guarantees).

Thoughts?  

P.S.: is somebody working on something similar, or are there plans to
implement something similar?

[1] "Intel® 64 and IA-32 Architectures Optimization Reference Manual",
section 3.4.1.1 Eliminating Branches:

"Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC
and CMOV instructions to eliminate unpredictable conditional branches where
possible. Do not do this for predictable branches. Do not use these
instructions to
eliminate all unpredictable conditional branches (because using these
instructions
will incur execution overhead due to the requirement for executing both
paths of a
conditional branch)."

[2] "Software Optimization Guide for AMD64 Processors", section 6.3
"Branches That Depend on Random Data": "Avoid conditional branches that
depend on random data, as these branches are difficult to predict.",
Examples/Conditional Write

Best regards,
--Edwin
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Introducing a branch optimization and prediction pass

Evan Cheng-2

On Mar 29, 2008, at 6:02 AM, Török Edwin wrote:

> Hi,
>
> I would like to transform unpredictable conditional branches with a
> small body into cmov instructions to reduce branch miss penalty.
> LLVM generates cmov/setcc instructions when SelectInst is used. The  
> idea
> is to transform regular branches into selects, when that is possible  
> and
> profitable.

LLVM is already aggressively turning branches into selects as far as I  
can see.

>
> However this needs branch prediction info, and it seems LLVM doesn't
> provide that AFAICT.
> Could gcc's branch predictor be  used here, or are there plans for an
> llvm-specific branch predictor?

Sure, either the FE or some earlier profiling pass should provide some  
branch predication info.

>
>
> This has performance benefits if the branch is otherwise  
> unpredictable,
> for example:
>
> for(i=0;i<len;i++) {
>  if(isalnum(buf[i]))
>    *p++ = buf[i];
> }
>
> Assume the input data is binary data, you can't know in advance if the
> branch is taken or not. Either prediction will lead to a miss  
> penalty X%
> of the time.
> The conditional write can be transformed using a dummy variable [2].

Ok. You want to speculatively execute the code and only writes back  
the result if the predicate is true, right?

>
>
> But, if the branch's body is big, or if is statically/dynamically
> predictable this optimization must not be applied.
> This optimization needs the following:
> * determine if the transformation can be applied:
>    * if the body has call/return/asm it cannot be applied
>
>    * if volatile data is accesed within the loop
>    * any other situations?

If there are any instructions with side effects then this cannot be  
done.

>
> * determine if branch is statically/dynamically predictable:
>    * statically: if compiler can predict branch, don't apply this
> optimization
>    * dynamically: can branch be predicted based on execution history
> (i.e. can CPU predict it?)

Based on profiling data from previous runs? Or dynamic profiling in  
JIT situation?

>
> * how to determine if a branch's body is small?
>    if execution_cost_of(body) +
> execution_cost_of(cmov)/branch_miss_probability < branch_miss_penalty,
>    or simply: execution_cost_of(body) < branch_miss_penalty
>    If it is bigger, then it is faster if we take the miss penalty (and
> don't execute the body)
>    But miss-penalties, and execution_costs are target specific, so use
> TargetData here (and teach it about penalties), or
>    use some heuristic based on number of instruction in basic block?

Sounds reasonable. But it's potentially more complicated than this for  
a target like x86 that have very few registers. The speculatively  
executed code will clobber registers so it has significant cost if the  
result of computation is not written back.

>
> * if we cannot transform all instructions from the basic block into
> instr+cmov, then it is not worth applying this optimization (since  
> we'll
> have a branch anyway).

Sure. It might be worthwhile to extend the if-converter to do this.  
The first step is to turn it on for x86 to convert blocks which are  
entirely move's.

>
> Does Codegen guarantee to always transform Select into CMOV/SETCC (or
> equivalent), on targets that support it?

Right.

>
> * if target doesn't support conditional moves, we must not apply this
> transformation

Correct.

>
> * memory ordering/multi-thread safety:
> if(pthread_mutex_trylock(...)) { *p++ = 4;}
> Transforming this into a cmov should be safe, because we are using a
> dummy variable on the stack as destination [2] if condition is false.
> What if we have memory reads, those might not be always be safe to  
> move
> out of the branch's body (because the condition could be required to
> ensure the access is valid),
> or it could actually decrease performance in case the guarding  
> condition
> is to prevent L2 cache misses, example:
> if(fast_filter_table[value] == match && slow_filter_table[value] ==
> match) { ... do something ... }

That's something to worry about later. :-)

>
>
> If we unconditionally read from slow_filter_table it could actually
> reduce the performance (assume slow_filter_table is huge), and might  
> not
> be legal because we violate
> the short-circuit evaluation.

Right, speculative loads would require much more analysis to ensure  
safety. This might be something you don't want to do in step 1.

>
>
> (BTW, the branch predictor should say it is predictable, and we  
> wouldn't
> apply the transformation at all, but we cannot rely on the branch
> predictor to make
> short-circuit eval guarantees).
>
> Thoughts?

Adding branch prediction info is obviously useful even if speculation  
is not performed. There are always CFG optimization passes (any  
perhaps isel passes) that can make use of them. I am not sure whether  
speculation can actually have a positive impact on x86. It's something  
that requires some experimentation if you are interested.

Evan


>
>
> P.S.: is somebody working on something similar, or are there plans to
> implement something similar?
>
> [1] "Intel® 64 and IA-32 Architectures Optimization Reference Manual",
> section 3.4.1.1 Eliminating Branches:
>
> "Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the  
> SETCC
> and CMOV instructions to eliminate unpredictable conditional  
> branches where
> possible. Do not do this for predictable branches. Do not use these
> instructions to
> eliminate all unpredictable conditional branches (because using these
> instructions
> will incur execution overhead due to the requirement for executing  
> both
> paths of a
> conditional branch)."
>
> [2] "Software Optimization Guide for AMD64 Processors", section 6.3
> "Branches That Depend on Random Data": "Avoid conditional branches  
> that
> depend on random data, as these branches are difficult to predict.",
> Examples/Conditional Write
>
> Best regards,
> --Edwin
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Introducing a branch optimization and prediction pass

Török Edwin
Evan Cheng wrote:

> On Mar 29, 2008, at 6:02 AM, Török Edwin wrote:
>
>  
>> Hi,
>>
>> I would like to transform unpredictable conditional branches with a
>> small body into cmov instructions to reduce branch miss penalty.
>> LLVM generates cmov/setcc instructions when SelectInst is used. The  
>> idea
>> is to transform regular branches into selects, when that is possible  
>> and
>> profitable.
>>    
>
> LLVM is already aggressively turning branches into selects as far as I  
> can see.
>
>  

Does this happen in Codegen?
I tried some simple examples a while ago, and IIRC llvm missed some
opportunities to use cmov.

>> However this needs branch prediction info, and it seems LLVM doesn't
>> provide that AFAICT.
>> Could gcc's branch predictor be  used here, or are there plans for an
>> llvm-specific branch predictor?
>>    
>
> Sure, either the FE or some earlier profiling pass should provide some  
> branch predication info.
>  

Ok. At least one of the above should be implemented before this
optimization can have a real use.
However for experimentation we could try to apply this optimization
assuming each branch in unpredictable.
At least we'll get an idea of what is the worst performance regression
that could happen if branch prediction info is wrong ;)

>  
> Ok. You want to speculatively execute the code and only writes back  
> the result if the predicate is true, right?
>  

Yes, I speculatively execute the code.
However the result is always written,  because I cannot use 'cmov' to
write to memory (just registers).
So the solution proposed in the AMD manual, is to use cmov to select the
address to write to. If predicate is false, it is written to a dummy
address on the stack:

cmp    eax, ebx          ; a < b ?
cmovge edi, esi          ; ptr = (a >= b) ? &dummy : &c[i]
cmovl ecx, edx           ; a < b ? i : i + 1
mov    [edi], eax        ; *ptr = a


IIRC gcc-4.2 didn't generate cmov for this case.

I'll write some simple examples, build with llvm, gcc-4.2, gcc-4.3, and
icc, then measure speeds. Last time I did that on a simple example, and
the result was 200 Mb/s vs. 1Gb/s.
Then apply this optimization (either by hand or using a small prototype)
and see if there are any significant improvements.

> If there are any instructions with side effects then this cannot be  
> done.
>  

Agreed.

>  
>> * determine if branch is statically/dynamically predictable:
>>    * statically: if compiler can predict branch, don't apply this
>> optimization
>>    * dynamically: can branch be predicted based on execution history
>> (i.e. can CPU predict it?)
>>    
>
> Based on profiling data from previous runs? Or dynamic profiling in  
> JIT situation?
>  

Good idea. However in absence of profiling info there should be some
heuristics, I am not sure what that could be ATM.

>  
>> * how to determine if a branch's body is small?
>>    if execution_cost_of(body) +
>> execution_cost_of(cmov)/branch_miss_probability < branch_miss_penalty,
>>    or simply: execution_cost_of(body) < branch_miss_penalty
>>    If it is bigger, then it is faster if we take the miss penalty (and
>> don't execute the body)
>>    But miss-penalties, and execution_costs are target specific, so use
>> TargetData here (and teach it about penalties), or
>>    use some heuristic based on number of instruction in basic block?
>>    
>
> Sounds reasonable. But it's potentially more complicated than this for  
> a target like x86 that have very few registers. The speculatively  
> executed code will clobber registers so it has significant cost if the  
> result of computation is not written back.
>  

I experimented on x86-64 a while ago, that doesn't have such a shortage
of registers, and results were promising.
I have some code that processes some data in a loop, but due to branch
misspredictions, it can "only" work at ~200 Mb/s.
Eliminating the branches could speed it up to 1Gb/s or more, since there
are no other penalties in that code (L1 data cache misses are rare, the
only bottleneck is branch misprediction).

I'll redo my tests on x86 too (see above), and maybe we can come up with
an empiric formula to use for x86.
Something like associating a cost with register pressure increase?

>> * if we cannot transform all instructions from the basic block into
>> instr+cmov, then it is not worth applying this optimization (since  
>> we'll
>> have a branch anyway).
>>    
>
> Sure. It might be worthwhile to extend the if-converter to do this.  
> The first step is to turn it on for x86 to convert blocks which are  
> entirely move's.
>  

That would mean to apply this optimization on machine-basic-blocks, right?
I was thinking of a generic llvm IR optimization pass, but maybe
machine-basic-blocks pass is better, since we are doing something very
specific for targets.

>> Does Codegen guarantee to always transform Select into CMOV/SETCC (or
>> equivalent), on targets that support it?
>>    
>
> Right.
>  

Ok.

>> * memory ordering/multi-thread safety:
> That's something to worry about later. :-)
>  

Ok.

>> If we unconditionally read from slow_filter_table it could actually
>> reduce the performance (assume slow_filter_table is huge), and might  
>> not
>> be legal because we violate
>> the short-circuit evaluation.
>>    
>
> Right, speculative loads would require much more analysis to ensure  
> safety. This might be something you don't want to do in step 1.
>  

Makes sense, first implementation shouldn't break too much ;)

>> (BTW, the branch predictor should say it is predictable, and we  
>> wouldn't
>> apply the transformation at all, but we cannot rely on the branch
>> predictor to make
>> short-circuit eval guarantees).
>>
>> Thoughts?
>>    
>
> Adding branch prediction info is obviously useful even if speculation  
> is not performed. There are always CFG optimization passes (any  
> perhaps isel passes) that can make use of them. I am not sure whether  
> speculation can actually have a positive impact on x86. It's something  
> that requires some experimentation if you are interested.
>
> Evan
>  

I have a prototype of this optimization (without branch prediction), and
I can use that to experiment.
However it only handles some very simple cases, and I should rewrite it
to be more generic.

Best regards,
--Edwin


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Introducing a branch optimization and prediction pass

Evan Cheng-2

On Mar 31, 2008, at 7:01 AM, Török Edwin wrote:

> Evan Cheng wrote:
>> On Mar 29, 2008, at 6:02 AM, Török Edwin wrote:
>>
>>
>>> Hi,
>>>
>>> I would like to transform unpredictable conditional branches with a
>>> small body into cmov instructions to reduce branch miss penalty.
>>> LLVM generates cmov/setcc instructions when SelectInst is used. The
>>> idea
>>> is to transform regular branches into selects, when that is possible
>>> and
>>> profitable.
>>>
>>
>> LLVM is already aggressively turning branches into selects as far  
>> as I
>> can see.
>>
>>
>
> Does this happen in Codegen?
> I tried some simple examples a while ago, and IIRC llvm missed some
> opportunities to use cmov.

In optimization passes. I'd need to see an example of the missed  
transformation unless you are referring to using cmov to implement  
speculative execution, which is something llvm doesn't do right now.


>
>
>>> However this needs branch prediction info, and it seems LLVM doesn't
>>> provide that AFAICT.
>>> Could gcc's branch predictor be  used here, or are there plans for  
>>> an
>>> llvm-specific branch predictor?
>>>
>>
>> Sure, either the FE or some earlier profiling pass should provide  
>> some
>> branch predication info.
>>
>
> Ok. At least one of the above should be implemented before this
> optimization can have a real use.
> However for experimentation we could try to apply this optimization
> assuming each branch in unpredictable.
> At least we'll get an idea of what is the worst performance regression
> that could happen if branch prediction info is wrong ;)

Sounds good.

>
>
>>
>> Ok. You want to speculatively execute the code and only writes back
>> the result if the predicate is true, right?
>>
>
> Yes, I speculatively execute the code.
> However the result is always written,  because I cannot use 'cmov' to
> write to memory (just registers).
> So the solution proposed in the AMD manual, is to use cmov to select  
> the
> address to write to. If predicate is false, it is written to a dummy
> address on the stack:
>
> cmp    eax, ebx          ; a < b ?
> cmovge edi, esi          ; ptr = (a >= b) ? &dummy : &c[i]
> cmovl ecx, edx           ; a < b ? i : i + 1
> mov    [edi], eax        ; *ptr = a

Right.

>
>
>
> IIRC gcc-4.2 didn't generate cmov for this case.
>
> I'll write some simple examples, build with llvm, gcc-4.2, gcc-4.3,  
> and
> icc, then measure speeds. Last time I did that on a simple example,  
> and
> the result was 200 Mb/s vs. 1Gb/s.
> Then apply this optimization (either by hand or using a small  
> prototype)
> and see if there are any significant improvements.

Sounds good.

>
>
>> If there are any instructions with side effects then this cannot be
>> done.
>>
>
> Agreed.
>
>>
>>> * determine if branch is statically/dynamically predictable:
>>>   * statically: if compiler can predict branch, don't apply this
>>> optimization
>>>   * dynamically: can branch be predicted based on execution history
>>> (i.e. can CPU predict it?)
>>>
>>
>> Based on profiling data from previous runs? Or dynamic profiling in
>> JIT situation?
>>
>
> Good idea. However in absence of profiling info there should be some
> heuristics, I am not sure what that could be ATM.

This can be done later. I think the first step should be to get the  
transformation piece ready.

>
>
>>
>>> * how to determine if a branch's body is small?
>>>   if execution_cost_of(body) +
>>> execution_cost_of(cmov)/branch_miss_probability <  
>>> branch_miss_penalty,
>>>   or simply: execution_cost_of(body) < branch_miss_penalty
>>>   If it is bigger, then it is faster if we take the miss penalty  
>>> (and
>>> don't execute the body)
>>>   But miss-penalties, and execution_costs are target specific, so  
>>> use
>>> TargetData here (and teach it about penalties), or
>>>   use some heuristic based on number of instruction in basic block?
>>>
>>
>> Sounds reasonable. But it's potentially more complicated than this  
>> for
>> a target like x86 that have very few registers. The speculatively
>> executed code will clobber registers so it has significant cost if  
>> the
>> result of computation is not written back.
>>
>
> I experimented on x86-64 a while ago, that doesn't have such a  
> shortage
> of registers, and results were promising.
> I have some code that processes some data in a loop, but due to branch
> misspredictions, it can "only" work at ~200 Mb/s.
> Eliminating the branches could speed it up to 1Gb/s or more, since  
> there
> are no other penalties in that code (L1 data cache misses are rare,  
> the
> only bottleneck is branch misprediction).
>
> I'll redo my tests on x86 too (see above), and maybe we can come up  
> with
> an empiric formula to use for x86.
> Something like associating a cost with register pressure increase?

I think, at least for x86 (32 bit), we should start by targeting very  
restricted cases, i.e. tiny basic blocks. Then you can add heuristics  
to compute the register pressure, etc.

>
>
>>> * if we cannot transform all instructions from the basic block into
>>> instr+cmov, then it is not worth applying this optimization (since
>>> we'll
>>> have a branch anyway).
>>>
>>
>> Sure. It might be worthwhile to extend the if-converter to do this.
>> The first step is to turn it on for x86 to convert blocks which are
>> entirely move's.
>>
>
> That would mean to apply this optimization on machine-basic-blocks,  
> right?
> I was thinking of a generic llvm IR optimization pass, but maybe
> machine-basic-blocks pass is better, since we are doing something very
> specific for targets.

Right. I think you want to do this in codegen where you have target  
information. Of course, this means propagating branch prediction  
information through optimization passes. That can be a real pain.

Once that complexity is solved, making use of the if-converter should  
be fairly easy. Right now, if-conversion predicate BB's which are  
completely predicable. You can teach it speculation by merging blocks  
and introduce merge blocks which are made of a number of move  
instructions. This should be pretty trivial for triangle sub-cfg's.

However, the current if-converter is a post-register allocation pass.  
That might not be right for your needs (since speculation will have to  
introduce new temporaries). One possibility is to add the speculation  
pass before register allocation, but leave the mov -> cmov job to the  
if-converter.

>
>
>>> Does Codegen guarantee to always transform Select into CMOV/SETCC  
>>> (or
>>> equivalent), on targets that support it?
>>>
>>
>> Right.
>>
>
> Ok.
>
>>> * memory ordering/multi-thread safety:
>> That's something to worry about later. :-)
>>
>
> Ok.
>
>>> If we unconditionally read from slow_filter_table it could actually
>>> reduce the performance (assume slow_filter_table is huge), and might
>>> not
>>> be legal because we violate
>>> the short-circuit evaluation.
>>>
>>
>> Right, speculative loads would require much more analysis to ensure
>> safety. This might be something you don't want to do in step 1.
>>
>
> Makes sense, first implementation shouldn't break too much ;)
>
>>> (BTW, the branch predictor should say it is predictable, and we
>>> wouldn't
>>> apply the transformation at all, but we cannot rely on the branch
>>> predictor to make
>>> short-circuit eval guarantees).
>>>
>>> Thoughts?
>>>
>>
>> Adding branch prediction info is obviously useful even if speculation
>> is not performed. There are always CFG optimization passes (any
>> perhaps isel passes) that can make use of them. I am not sure whether
>> speculation can actually have a positive impact on x86. It's  
>> something
>> that requires some experimentation if you are interested.
>>
>> Evan
>>
>
> I have a prototype of this optimization (without branch prediction),  
> and
> I can use that to experiment.
> However it only handles some very simple cases, and I should rewrite  
> it
> to be more generic.

Sounds good. Thanks.

Evan

>
>
> Best regards,
> --Edwin
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Introducing a branch optimization and prediction pass

Török Edwin
Evan Cheng wrote:
[snip]
>> Good idea. However in absence of profiling info there should be some
>> heuristics, I am not sure what that could be ATM.
>>    
>
> This can be done later. I think the first step should be to get the  
> transformation piece ready.
>  

Ok.

> I think, at least for x86 (32 bit), we should start by targeting very  
> restricted cases, i.e. tiny basic blocks. Then you can add heuristics  
> to compute the register pressure, etc.
>  

Ok.

>> That would mean to apply this optimization on machine-basic-blocks,  
>> right?
>> I was thinking of a generic llvm IR optimization pass, but maybe
>> machine-basic-blocks pass is better, since we are doing something very
>> specific for targets.
>>    
>
> Right. I think you want to do this in codegen where you have target  
> information. Of course, this means propagating branch prediction  
> information through optimization passes. That can be a real pain.
>
> Once that complexity is solved, making use of the if-converter should  
> be fairly easy. Right now, if-conversion predicate BB's which are  
> completely predicable. You can teach it speculation by merging blocks  
> and introduce merge blocks which are made of a number of move  
> instructions. This should be pretty trivial for triangle sub-cfg's.
>
> However, the current if-converter is a post-register allocation pass.  
> That might not be right for your needs (since speculation will have to  
> introduce new temporaries). One possibility is to add the speculation  
> pass before register allocation, but leave the mov -> cmov job to the  
> if-converter.
>  
I'll have a look at the if-converter sometime this weekend.

>> I have a prototype of this optimization (without branch prediction),  
>> and
>> I can use that to experiment.
>> However it only handles some very simple cases, and I should rewrite  
>> it
>> to be more generic.
>>    
>
> Sounds good. Thanks.
>
> Evan
I wrote some small tests, see the attachments for a gcc 4.2, 4.3, icc,
llvm, llvm+branchopt comparison.
I didn't try tweaking compiler flags much.

On x86-64:

+ ./loops_llvm
CPU user time: 76661, speed: 1304.444Mb/s
CPU user time: 119993, speed: 833.382Mb/s
CPU user time: 173322, speed: 576.961Mb/s
CPU user time: 133324, speed: 750.053Mb/s
CPU user time: 676623, speed: 147.793Mb/s
CPU user time: 776616, speed: 128.764Mb/s
CPU user time: 739952, speed: 135.144Mb/s
CPU user time: 903274, speed: 110.708Mb/s
+ ./loops_llvm_branchopt
CPU user time: 79995, speed: 1250.078Mb/s
CPU user time: 123325, speed: 810.866Mb/s
CPU user time: 169989, speed: 588.273Mb/s
CPU user time: 133325, speed: 750.047Mb/s
CPU user time: 679956, speed: 147.068Mb/s
CPU user time: 329978, speed: 303.051Mb/s
CPU user time: 543298, speed: 184.061Mb/s
CPU user time: 596628, speed: 167.609Mb/s

On x86-32:

+ ./loops_llvm
CPU user time: 96661, speed: 1034.543Mb/s
CPU user time: 129991, speed: 769.284Mb/s
CPU user time: 256650, speed: 389.636Mb/s
CPU user time: 163323, speed: 612.284Mb/s
CPU user time: 739951, speed: 135.144Mb/s
CPU user time: 829946, speed: 120.490Mb/s
CPU user time: 803281, speed: 124.489Mb/s
CPU user time: 1086596, speed: 92.031Mb/s
+ ./loops_llvm_branchopt
CPU user time: 103326, speed: 967.811Mb/s
CPU user time: 129992, speed: 769.278Mb/s
CPU user time: 256650, speed: 389.636Mb/s
CPU user time: 166656, speed: 600.038Mb/s
CPU user time: 739951, speed: 135.144Mb/s
CPU user time: 593295, speed: 168.550Mb/s
CPU user time: 939939, speed: 106.390Mb/s
CPU user time: 1066597, speed: 93.756Mb/s

It benefits x86-64 on the last 3 tests, while it benefits x86-32 only on
1 test, and hurts another one.

I haven't looked at the generated assembly yet, that is the next step.
However tests 4 and 5 should have same runtime, yet they differ hugely.
I'll file bugs.

Best regards,
--Edwin

+ ./loops_gcc43
CPU user time: 76661, speed: 1304.444Mb/s
CPU user time: 93328, speed: 1071.490Mb/s
CPU user time: 163322, speed: 612.287Mb/s
CPU user time: 96661, speed: 1034.543Mb/s
CPU user time: 663290, speed: 150.764Mb/s
CPU user time: 279981, speed: 357.167Mb/s
CPU user time: 766617, speed: 130.443Mb/s
CPU user time: 773283, speed: 129.319Mb/s
+ ./loops_gcc43_unroll
CPU user time: 76661, speed: 1304.444Mb/s
CPU user time: 93327, speed: 1071.501Mb/s
CPU user time: 119993, speed: 833.382Mb/s
CPU user time: 96660, speed: 1034.554Mb/s
CPU user time: 676623, speed: 147.793Mb/s
CPU user time: 279981, speed: 357.167Mb/s
CPU user time: 729952, speed: 136.995Mb/s
CPU user time: 726620, speed: 137.624Mb/s
+ ./loops_gcc42
CPU user time: 76662, speed: 1304.427Mb/s
CPU user time: 156656, speed: 638.341Mb/s
CPU user time: 119992, speed: 833.389Mb/s
CPU user time: 163323, speed: 612.284Mb/s
CPU user time: 666623, speed: 150.010Mb/s
CPU user time: 339978, speed: 294.137Mb/s
CPU user time: 783282, speed: 127.668Mb/s
CPU user time: 813280, speed: 122.959Mb/s
+ ./loops_icc
CPU user time: 76662, speed: 1304.427Mb/s
CPU user time: 219985, speed: 454.576Mb/s
CPU user time: 159990, speed: 625.039Mb/s
CPU user time: 93327, speed: 1071.501Mb/s
CPU user time: 273316, speed: 365.877Mb/s
CPU user time: 283314, speed: 352.965Mb/s
CPU user time: 716620, speed: 139.544Mb/s
CPU user time: 819947, speed: 121.959Mb/s
+ ./loops_llvm
CPU user time: 79995, speed: 1250.078Mb/s
CPU user time: 119992, speed: 833.389Mb/s
CPU user time: 169989, speed: 588.273Mb/s
CPU user time: 129991, speed: 769.284Mb/s
CPU user time: 693288, speed: 144.240Mb/s
CPU user time: 766617, speed: 130.443Mb/s
CPU user time: 736618, speed: 135.756Mb/s
CPU user time: 909941, speed: 109.897Mb/s
+ ./loops_llvm_branchopt
CPU user time: 76661, speed: 1304.444Mb/s
CPU user time: 123325, speed: 810.866Mb/s
CPU user time: 173322, speed: 576.961Mb/s
CPU user time: 136658, speed: 731.754Mb/s
CPU user time: 683289, speed: 146.351Mb/s
CPU user time: 339978, speed: 294.137Mb/s
CPU user time: 543298, speed: 184.061Mb/s
CPU user time: 589961, speed: 169.503Mb/s

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <string.h>

typedef int (*testfunc)(const unsigned char *buf, size_t len, unsigned char *dst);

static int copy_memcpy(const unsigned char *buf, size_t len, unsigned char *dst)
{
        memcpy(dst, buf, len);
        return 0;
}

static int copy(const unsigned char *buf, size_t len, unsigned char *dst)
{
        while(len-- > 0) {
                *dst++ = *buf++;
        }
        return 0;
}

static int sum(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i, s=0;
        for(i=0;i<len;i++) {
                if(buf[i] > 0x80) {
                        s++;
                }
        }
        return s;
}

static int reset8(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i;
        for(i=0;i<len;i++) {
                /* not really for cmov, but this can be simply written as:
                 * dst[i] = buf[i] & 0x7f
                 * no need for cmov here, and none should be used*/
                if(buf[i] > 0x80) {
                        dst[i] = buf[i] & 0x7f;
                } else {
                        dst[i] = buf[i];
                }
        }
        return 0;
}

static int reset8_smart(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i;
        for(i=0;i<len;i++) {
                dst[i] = buf[i] & 0x7f;
        }
        return 0;
}

enum normalize_action {
        NORMALIZE_COPY,
        NORMALIZE_SKIP,
        NORMALIZE_AS_WHITESPACE,
        NORMALIZE_ADD_32
};

/* use shorter names in the table */
#define IGN NORMALIZE_SKIP
#define WSP NORMALIZE_AS_WHITESPACE
#define A32 NORMALIZE_ADD_32
#define NOP NORMALIZE_COPY

static const enum normalize_action char_action[256] = {
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, WSP, WSP, WSP, WSP, WSP, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        WSP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP,/* 0x20 - 0x2f */
        NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP,
        NOP, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32,
        A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, A32, NOP, NOP, NOP, NOP, NOP,
        NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP,
        NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP, NOP,/* 0x70 - 0x7f */
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN,
        IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN, IGN
};

static int copy_skip(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i;
        for(i=0;i < len;i++) {
                unsigned char c = buf[i];
                if(char_action[c] != NORMALIZE_SKIP) {
                        *dst++ = c;
                }
        }
        return 0;
}

static int copy_skip2(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i;
        unsigned char dummy;
        for(i=0;i < len;i++) {
                unsigned char c = buf[i];
                bool cond = (char_action[c] != NORMALIZE_SKIP);
                unsigned char *d = cond ? dst : &dummy;
                dst = cond ? dst+1 : dst;
                *d = c;
        }
        return 0;
}

static int copy_skip3(const unsigned char *buf, size_t len, unsigned char *dst)
{
        size_t i;
        for(i=0;i<len;i++) {
                unsigned char c = buf[i];
                *dst = c;
                dst = char_action[c] != NORMALIZE_SKIP ? dst + 1 : dst;
        }
        return 0;
}

#define DIFFT(a,b) (((a).tv_sec - (b).tv_sec)*1000000+(a.tv_usec - b.tv_usec))
#define NLOOPS 10
#define TESTSIZE 10*1024*1024

static int measure(testfunc func, const unsigned char *src, size_t len, unsigned char *dst)
{
        struct rusage r0,r1;
        size_t i, x=0;
        getrusage(RUSAGE_SELF, &r0);
        for(i=0; i < NLOOPS; i++) {
                x += func(src, len, dst);
        }
        getrusage(RUSAGE_SELF, &r1);
        ssize_t d = DIFFT(r1.ru_stime, r0.ru_stime) + DIFFT(r1.ru_utime,  r0.ru_utime);
        printf("CPU user time: %ld, speed: %.3fMb/s\n", d, 1000000.0/(1024*1024)* TESTSIZE*NLOOPS / d);
        return x;
}

static testfunc testlist[] = {
        copy_memcpy,
        copy,
        sum,
        reset8_smart,
        reset8,
        copy_skip3,
        copy_skip,
        copy_skip2
};


static void generate(unsigned char *dst, size_t len)
{
        size_t i;
        srand(10);/* we need reproducable testcase */
        for(i=0;i < len; i++) {
                dst[i] = (256 * (rand() / (RAND_MAX + 1.0)));
        }
}

int main()
{
        unsigned char *src = malloc(TESTSIZE);
        unsigned char *dst = malloc(TESTSIZE);
        size_t i, x=0;
        if(!src || !dst) {
                fprintf(stderr,"OOM\n");
                return 1;
        }
        generate(src, TESTSIZE);
        for(i=0;i<sizeof(testlist)/sizeof(testlist[0]); i++) {
                x += measure(testlist[i], src, TESTSIZE, dst);
        }
        free(src);
        free(dst);
        return x;
}


+ ./loops_gcc43
CPU user time: 99993, speed: 1000.070Mb/s
CPU user time: 139991, speed: 714.332Mb/s
CPU user time: 249984, speed: 400.026Mb/s
CPU user time: 193321, speed: 517.274Mb/s
CPU user time: 739952, speed: 135.144Mb/s
CPU user time: 593294, speed: 168.550Mb/s
CPU user time: 936606, speed: 106.768Mb/s
CPU user time: 989935, speed: 101.017Mb/s
+ ./loops_gcc43_unroll
CPU user time: 99993, speed: 1000.070Mb/s
CPU user time: 106659, speed: 937.567Mb/s
CPU user time: 209987, speed: 476.220Mb/s
CPU user time: 123325, speed: 810.866Mb/s
CPU user time: 669956, speed: 149.264Mb/s
CPU user time: 589962, speed: 169.502Mb/s
CPU user time: 806615, speed: 123.975Mb/s
CPU user time: 1003267, speed: 99.674Mb/s
+ ./loops_gcc42
CPU user time: 93327, speed: 1071.501Mb/s
CPU user time: 136658, speed: 731.754Mb/s
CPU user time: 223319, speed: 447.790Mb/s
CPU user time: 196653, speed: 508.510Mb/s
CPU user time: 746618, speed: 133.937Mb/s
CPU user time: 643292, speed: 155.450Mb/s
CPU user time: 866610, speed: 115.392Mb/s
CPU user time: 959937, speed: 104.174Mb/s
+ ./loops_icc
CPU user time: 83328, speed: 1200.077Mb/s
CPU user time: 196654, speed: 508.507Mb/s
CPU user time: 119992, speed: 833.389Mb/s
CPU user time: 96660, speed: 1034.554Mb/s
CPU user time: 276649, speed: 361.469Mb/s
CPU user time: 336645, speed: 297.049Mb/s
CPU user time: 779949, speed: 128.214Mb/s
CPU user time: 1066597, speed: 93.756Mb/s
+ ./loops_llvm
CPU user time: 96661, speed: 1034.543Mb/s
CPU user time: 129991, speed: 769.284Mb/s
CPU user time: 256650, speed: 389.636Mb/s
CPU user time: 163323, speed: 612.284Mb/s
CPU user time: 739951, speed: 135.144Mb/s
CPU user time: 829946, speed: 120.490Mb/s
CPU user time: 803281, speed: 124.489Mb/s
CPU user time: 1086596, speed: 92.031Mb/s
+ ./loops_llvm_branchopt
CPU user time: 103326, speed: 967.811Mb/s
CPU user time: 129992, speed: 769.278Mb/s
CPU user time: 256650, speed: 389.636Mb/s
CPU user time: 166656, speed: 600.038Mb/s
CPU user time: 739951, speed: 135.144Mb/s
CPU user time: 593295, speed: 168.550Mb/s
CPU user time: 939939, speed: 106.390Mb/s
CPU user time: 1066597, speed: 93.756Mb/s

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev