[llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev
Hi,

I would like to ask what IssueWidth and NumMicroOps refer to in
MachineScheduler, just to be 100% sure what the intent is.
Are we modeling the decoder phase or the execution stage?

Background:

First of all, there seems to be different meanings of "issue" depending
on which platform you're on:

https://stackoverflow.com/questions/23219685/what-is-the-meaning-of-instruction-dispatch:
"... "Dispatch in this sense means either the sending of an instruction
to a queue in preparation to be scheduled in an out-of-order
processor (IBM's use; Intel calls this issue) or sending the instruction
to the functional unit for execution (Intel's use; IBM calls this issue)..."

So "issue" could mean either of
(1) "the sending of an instruction to a queue in preparation to be
scheduled in an out-of-order processor"
(2) "sending the instruction to the functional unit for execution"

I would hope to be right when I think that IssueWidth (1) would relate
to the decoding capacity, while (2) would reflect the executional
capacity per cycle.

There is this comment in TargetSchedule.td:

// Use BufferSize = 0 for resources that force "dispatch/issue
// groups". (Different processors define dispath/issue
// differently. Here we refer to stage between decoding into micro-ops
// and moving them into a reservation station.) Normally NumMicroOps
// is sufficient to limit dispatch/issue groups. However, some
// processors can form groups of with only certain combinitions of
// instruction types. e.g. POWER7.

This seems to say that in MachineScheduler, (1) is in effect, right?

Furthermore, I see

def SkylakeServerModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and SKylake can
// decode 6 instructions per cycle.
    let IssueWidth = 6;

def BroadwellModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and HW can
decode 4
// instructions per cycle.
    let IssueWidth = 4;

def SandyBridgeModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and SB can
decode 4
// instructions per cycle.
// FIXME: Identify instructions that aren't a single fused micro-op.
    let IssueWidth = 4;

, which also seem to indicate (1).

What's more, I see that checkHazard() returns true if '(CurrMOps + uops
 > SchedModel->getIssueWidth())'.
This means that the SU will be put in Pending instead of Available based
on the number of microops it uses.
To me this seems like an in-order decoding hazard check, since an OOO
machine will rearrange the microops
during execution, so there is not much use in checking for the sum of
the executional capacity of the current SU
candidate and the immediately previously scheduled here. I then again
would say (1). (Checking for decoder groups
pre-RA does BTW not make much sense on SystemZ, but that's another
question).

checkHazard() also return hazard if

     (CurrMOps > 0 &&
       ((isTop() && SchedModel->mustBeginGroup(SU->getInstr())) ||
        (!isTop() && SchedModel->mustEndGroup(SU->getInstr()))))

, which also per the same lines makes me think that this is intended for
the instruction stream management, or (1).

There is also the fact that

IsResourceLimited =
       checkResourceLimit(SchedModel->getLatencyFactor(),
getCriticalCount(),
                          getScheduledLatency());

, which is to me admittedly hard to grasp, but it seems that the
scheduled latency (std::max(ExpectedLatency, CurrCycle))
affects the resource heuristic so that if scheduled latency is low
enough, it becomes active. This then means that CurrCycle
actually affects when resource balancing goes into action, and CurrCycle
in turn is advanced when NumMicroOps reach the
IssueWidth. So somehow it all depends on modelling the instructions to
fill upp the IssueWidth by their microops. This could
actually either be
* Decoder cycles: NumDecoderSlots(SU) => SU->NumMicroOps and
DecoderCapacity => IssueWidth  (1)
or
* Execution cycles: NumExecutedUOps(SU) => SU->NumMicroOps and
ApproxMaxExecutedUOpsPerCycle => IssueWidth (2)

They would at least in this context be somewhat equievalent in driving
CurrCycle forward.

Please, let me know about (1) or (2)  :-)

thanks

/Jonas


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev


> On May 9, 2018, at 9:43 AM, Jonas Paulsson <[hidden email]> wrote:
>
> Hi,
>
> I would like to ask what IssueWidth and NumMicroOps refer to in MachineScheduler, just to be 100% sure what the intent is.
> Are we modeling the decoder phase or the execution stage?
>
> Background:
>
> First of all, there seems to be different meanings of "issue" depending on which platform you're on:
>
> https://stackoverflow.com/questions/23219685/what-is-the-meaning-of-instruction-dispatch:
> "... "Dispatch in this sense means either the sending of an instruction to a queue in preparation to be scheduled in an out-of-order
> processor (IBM's use; Intel calls this issue) or sending the instruction to the functional unit for execution (Intel's use; IBM calls this issue)..."
>
> So "issue" could mean either of
> (1) "the sending of an instruction to a queue in preparation to be scheduled in an out-of-order processor"
> (2) "sending the instruction to the functional unit for execution"
>
> I would hope to be right when I think that IssueWidth (1) would relate to the decoding capacity, while (2) would reflect the executional
> capacity per cycle.
>
> There is this comment in TargetSchedule.td:
>
> // Use BufferSize = 0 for resources that force "dispatch/issue
> // groups". (Different processors define dispath/issue
> // differently. Here we refer to stage between decoding into micro-ops
> // and moving them into a reservation station.) Normally NumMicroOps
> // is sufficient to limit dispatch/issue groups. However, some
> // processors can form groups of with only certain combinitions of
> // instruction types. e.g. POWER7.
>
> This seems to say that in MachineScheduler, (1) is in effect, right?
>
> Furthermore, I see
>
> def SkylakeServerModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and SKylake can
> // decode 6 instructions per cycle.
>    let IssueWidth = 6;
>
> def BroadwellModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and HW can decode 4
> // instructions per cycle.
>    let IssueWidth = 4;
>
> def SandyBridgeModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and SB can decode 4
> // instructions per cycle.
> // FIXME: Identify instructions that aren't a single fused micro-op.
>    let IssueWidth = 4;
>
> , which also seem to indicate (1).
>
> What's more, I see that checkHazard() returns true if '(CurrMOps + uops > SchedModel->getIssueWidth())'.
> This means that the SU will be put in Pending instead of Available based on the number of microops it uses.
> To me this seems like an in-order decoding hazard check, since an OOO machine will rearrange the microops
> during execution, so there is not much use in checking for the sum of the executional capacity of the current SU
> candidate and the immediately previously scheduled here. I then again would say (1). (Checking for decoder groups
> pre-RA does BTW not make much sense on SystemZ, but that's another question).
>
> checkHazard() also return hazard if
>
>     (CurrMOps > 0 &&
>       ((isTop() && SchedModel->mustBeginGroup(SU->getInstr())) ||
>        (!isTop() && SchedModel->mustEndGroup(SU->getInstr()))))
>
> , which also per the same lines makes me think that this is intended for the instruction stream management, or (1).
>
> There is also the fact that
>
> IsResourceLimited =
>       checkResourceLimit(SchedModel->getLatencyFactor(), getCriticalCount(),
>                          getScheduledLatency());
>
> , which is to me admittedly hard to grasp, but it seems that the scheduled latency (std::max(ExpectedLatency, CurrCycle))
> affects the resource heuristic so that if scheduled latency is low enough, it becomes active. This then means that CurrCycle
> actually affects when resource balancing goes into action, and CurrCycle in turn is advanced when NumMicroOps reach the
> IssueWidth. So somehow it all depends on modelling the instructions to fill upp the IssueWidth by their microops. This could
> actually either be
> * Decoder cycles: NumDecoderSlots(SU) => SU->NumMicroOps and DecoderCapacity => IssueWidth  (1)
> or
> * Execution cycles: NumExecutedUOps(SU) => SU->NumMicroOps and ApproxMaxExecutedUOpsPerCycle => IssueWidth (2)
>
> They would at least in this context be somewhat equievalent in driving CurrCycle forward.
>
> Please, let me know about (1) or (2)  :-)
>
> thanks
>
> /Jonas

I'll first try to frame your question with the background philosophy, then give you my take, but other feedback and discussion is welcome.

The LLVM machine model is an abstract machine. A real micro-architecture can have any number of buffers, queues, and stages. Declaring that a given machine-independent abstract property corresponds to a specific physical property across all subtargets can't be done. That said, target maintainers still need to know how to relate the abstract to the physical. The target maintainer can then extend the abstract model with their own machine specific resources.

The abstract pipeline is built around the notion of an "issue point". This is merely a reference point for counting machine cycles. The primary goal of the scheduler is to simply know when enough "time" has passed between scheduling dependent instructions.

The physical machine will have pipeline stages that delay execution. The scheduler does not model those delays because they are irrelevant as long as they are consistent. Inaccuracies arise when instructions have different execution delays relative to each other, in addition to their intrinsic latency. To model those delays, the abstract model has various tools like ReadAdvance (bypassing) and the ability to extend the model with arbitrary "resources" and associate a cycle count with those resources for each instruction. (One tool currently missing is the ability to add a delay to ResourceCycles, but that would be easy to add).

Now we come to out-of-order execution, or, more generally, instruction buffers. Part of the CPU pipeline is always in-order. The issue point, which is the point of reference for counting cycles, only makes sense as an in-order part of the pipeline. Other parts of the pipeline are sometimes falling behind and sometimes catching up. It's only interesting to model those other, decoupled parts of the pipeline if they may be predictably resource constrained in a way that the scheduler can exploit.

The LLVM machine model distinguishes between in-order constraints and out-of-order constraints so that the target's scheduling strategy can apply appropriate heuristics. For a well-balanced CPU pipeline, out-of-order resources would not typically be treated as a hard scheduling constraint. For example, in the GenericScheduler, a delay caused by limited out-of-order resources is not directly reflected in the number of cycles that the scheduler sees between issuing an instruction and its dependent instructions. In other words, out-of-order resources don't directly increase the latency between pairs of instructions. However, they can still be used to detect potential bottlenecks across a sequence of instructions and bias the scheduling heuristics appropriately.

IssueWidth is meant to be a hard in-order constraint. We sometimes call this kind of constraint a "hazard"). In the GenericScheduler strategy, no more than IssueWidth micro-ops can ever be scheduled in a particular cycle. So, if an instruction sequence has enough ILP to exceed IssueWidth, that will immediately increase the currently scheduling cycle, and effectively bring dependent instructions into the ready queue earlier.

In practice, I think IssueWidth is useful to model to the bottleneck between the decoder (after micro-op expansion) and the out-of-order reservation stations. If the total number of reservation stations is also a bottleneck, or if any other pipeline stage has a bandwidth limitation, then that can be naturally modeled by adding an out-of-order processor resource.

> I would hope to be right when I think that IssueWidth (1) would relate to the decoding > capacity, while (2) would reflect the executional
> capacity per cycle.

I don't think IssueWidth necessarily has anything to do with instruction decoding or the execution capacity of functional units. I will say that we expect the decoding capacity to "keep up with" the issue width. If the IssueWidth property also serves that purpose for you, I think that's fine. In the case of the x86 machine models above, since each instruction is a micro-op, I don't see any useful distinction between decode bandwidth and in-order issue of micro-ops.

Some target maintainers may want to schedule for an OOO machine as if it were in-order. They are welcome to do that (and hopefully have plenty of architectural registers). The scheduling mode can be broadly selected with the infamous MicroOpBufferSize setting, or individual resources can be marked in-order with BufferSize=0. And, as always, I suggest writing your own scheduling strategy of you care that deeply about scheduling for the peculiarities of your machine.

(caveat: there may still be GenericScheduler implementation deficiencies because it is trying to support more scheduling features than we have in-tree targets).

Sorry, I don't have time to draw diagrams and tables. Hopefully you can makes sense of my long-form rambling.

Thanks for the question.

-Andy
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev
Hi Andrew,

Thank you very much for the most helpful explanations! Many things could
go in as comments, if you ask me - for example:

---
> The LLVM machine model is an abstract machine.

> The abstract pipeline is built around the notion of an "issue point". This is merely a reference point for counting machine cycles.
>
>
> IssueWidth is meant to be a hard in-order constraint (we sometimes call this kind of constraint a "hazard"). In the GenericScheduler strategy, no more than IssueWidth micro-ops can ever be scheduled in a particular cycle.
>
> In practice, IssueWidth is useful to model to the bottleneck between the decoder (after micro-op expansion) and the out-of-order reservation stations. If the total number of reservation stations is also a bottleneck, or if any other pipeline stage has a bandwidth limitation, then that can be naturally modeled by adding an out-of-order processor resource.

---

> I don't think IssueWidth necessarily has anything to do with instruction decoding or the execution capacity of functional units. I will say that we expect the decoding capacity to "keep up with" the issue width. If the IssueWidth property also serves that purpose for you, I think that's fine. In the case of the x86 machine models above, since each instruction is a micro-op, I don't see any useful distinction between decode bandwidth and in-order issue of micro-ops.
I think this is mostly true for SystemZ also since the majority of
instructions are basically a single micro-op.

>
> (caveat: there may still be GenericScheduler implementation deficiencies because it is trying to support more scheduling features than we have in-tree targets).
Right now it seems that BeginGroup/EndGroup is only used by SystemZ, or?
I see they are used in checkHazard(), which I actually don't see as
helpful during pre-RA scheduling for SystemZ. Could this be made
optional, or perhaps only done post-RA if target does post-RA
scheduling? SystemZ does post-RA scheduling to manage decoder grouping,
which is where the BeginGroup/EndGroup and IssueWidth/NumMicroOps is
useful. However doing this pre-RA and thereby limiting the freedom of
other heuristics (making less instructions available) seems like a bad idea.

> Sorry, I don't have time to draw diagrams and tables. Hopefully you can makes sense of my long-form rambling.
Yes, very helpful to me :-)

Thanks again,

Jonas

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev


On May 14, 2018, at 11:10 AM, Jonas Paulsson <[hidden email]> wrote:

Hi Andrew,

Thank you very much for the most helpful explanations! Many things could go in as comments, if you ask me - for example:

---
The LLVM machine model is an abstract machine.

The abstract pipeline is built around the notion of an "issue point". This is merely a reference point for counting machine cycles.


IssueWidth is meant to be a hard in-order constraint (we sometimes call this kind of constraint a "hazard"). In the GenericScheduler strategy, no more than IssueWidth micro-ops can ever be scheduled in a particular cycle.

In practice, IssueWidth is useful to model to the bottleneck between the decoder (after micro-op expansion) and the out-of-order reservation stations. If the total number of reservation stations is also a bottleneck, or if any other pipeline stage has a bandwidth limitation, then that can be naturally modeled by adding an out-of-order processor resource.


https://reviews.llvm.org/D46841

I don't think IssueWidth necessarily has anything to do with instruction decoding or the execution capacity of functional units. I will say that we expect the decoding capacity to "keep up with" the issue width. If the IssueWidth property also serves that purpose for you, I think that's fine. In the case of the x86 machine models above, since each instruction is a micro-op, I don't see any useful distinction between decode bandwidth and in-order issue of micro-ops.
I think this is mostly true for SystemZ also since the majority of instructions are basically a single micro-op.


(caveat: there may still be GenericScheduler implementation deficiencies because it is trying to support more scheduling features than we have in-tree targets).
Right now it seems that BeginGroup/EndGroup is only used by SystemZ, or? I see they are used in checkHazard(), which I actually don't see as helpful during pre-RA scheduling for SystemZ. Could this be made optional, or perhaps only done post-RA if target does post-RA scheduling? SystemZ does post-RA scheduling to manage decoder grouping, which is where the BeginGroup/EndGroup and IssueWidth/NumMicroOps is useful. However doing this pre-RA and thereby limiting the freedom of other heuristics (making less instructions available) seems like a bad idea.

I've worked on a few cpus in the past that had issue group restrictions. It seems like a natural way to handle special kinds of instructions. But I'm not aware of any LLVM backend that depends on it for preRA scheduling. If they are, hopefully they're reading this and will speak up.

My thinking a few years back was that targets would only run post-RA scheduling in rare cases and only for blocks with spill code, as a spill-fixup pass. That's not what you, and probably others are doing, so if you want to make those Begin/EndGroup post-RA specific, it's fine with me. Or you could be more ambitious and introduce the concept of a post-RA specific processor resource.

-Andy


Sorry, I don't have time to draw diagrams and tables. Hopefully you can makes sense of my long-form rambling.
Yes, very helpful to me :-)

Thanks again,

Jonas



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev
Hi Andy,

>> Right now it seems that BeginGroup/EndGroup is only used by SystemZ,
>> or? I see they are used in checkHazard(), which I actually don't see
>> as helpful during pre-RA scheduling for SystemZ. Could this be made
>> optional, or perhaps only done post-RA if target does post-RA
>> scheduling? SystemZ does post-RA scheduling to manage decoder
>> grouping, which is where the BeginGroup/EndGroup and
>> IssueWidth/NumMicroOps is useful. However doing this pre-RA and
>> thereby limiting the freedom of other heuristics (making less
>> instructions available) seems like a bad idea.
>
> I've worked on a few cpus in the past that had issue group
> restrictions. It seems like a natural way to handle special kinds of
> instructions. But I'm not aware of any LLVM backend that depends on it
> for preRA scheduling. If they are, hopefully they're reading this and
> will speak up.
>
> My thinking a few years back was that targets would only run post-RA
> scheduling in rare cases and only for blocks with spill code, as a
> spill-fixup pass. That's not what you, and probably others are doing,
> so if you want to make those Begin/EndGroup post-RA specific, it's
> fine with me. Or you could be more ambitious and introduce the concept
> of a post-RA specific processor resource.
>
This patch: https://reviews.llvm.org/D46870, put those checks under a
post-RA flag, and also covers the NumUOps / IssueWidth the same way.

/Jonas

> -Andy
>
>>
>>> Sorry, I don't have time to draw diagrams and tables. Hopefully you
>>> can makes sense of my long-form rambling.
>> Yes, very helpful to me :-)
>>
>> Thanks again,
>>
>> Jonas
>>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Muhui Jiang via llvm-dev

Hi Jonas:


The 'Single Issue', wherein an otherwise dual-issue machine can only single issue certain instruction, relies on BeginGroup/EndGroup and this change might  affect that.


- Javed


From: Jonas Paulsson <[hidden email]>
Sent: 15 May 2018 12:39:17
To: Andrew Trick
Cc: llvm-dev; Javed Absar; Florian Hahn; Matthias Braun; Hal Finkel; Ulrich Weigand
Subject: Re: [MachineScheduler] Question about IssueWidth / NumMicroOps
 
Hi Andy,

>> Right now it seems that BeginGroup/EndGroup is only used by SystemZ,
>> or? I see they are used in checkHazard(), which I actually don't see
>> as helpful during pre-RA scheduling for SystemZ. Could this be made
>> optional, or perhaps only done post-RA if target does post-RA
>> scheduling? SystemZ does post-RA scheduling to manage decoder
>> grouping, which is where the BeginGroup/EndGroup and
>> IssueWidth/NumMicroOps is useful. However doing this pre-RA and
>> thereby limiting the freedom of other heuristics (making less
>> instructions available) seems like a bad idea.
>
> I've worked on a few cpus in the past that had issue group
> restrictions. It seems like a natural way to handle special kinds of
> instructions. But I'm not aware of any LLVM backend that depends on it
> for preRA scheduling. If they are, hopefully they're reading this and
> will speak up.
>
> My thinking a few years back was that targets would only run post-RA
> scheduling in rare cases and only for blocks with spill code, as a
> spill-fixup pass. That's not what you, and probably others are doing,
> so if you want to make those Begin/EndGroup post-RA specific, it's
> fine with me. Or you could be more ambitious and introduce the concept
> of a post-RA specific processor resource.
>
This patch: https://reviews.llvm.org/D46870, put those checks under a
post-RA flag, and also covers the NumUOps / IssueWidth the same way.

/Jonas

> -Andy
>
>>
>>> Sorry, I don't have time to draw diagrams and tables. Hopefully you
>>> can makes sense of my long-form rambling.
>> Yes, very helpful to me :-)
>>
>> Thanks again,
>>
>> Jonas
>>
>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev