[llvm-dev] RFC: System (cache, etc.) model for LLVM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Hi everyone,

During the loop optimization roundtable at this year's Bay Area
conference, Hal noted that even the most basic of cost modeling is not
possible due to the lack of cache models within LLVM.  This topic has
been raised many times in the past.  Cray is now in a position where we
can contribute the system model we've been using successfully in our
compiler for many years.  This RFC describes the current design with a
few changes to update it to existing ToT interfaces.

We anticipate lots of feedback and suggested changes.  This RFC is a
strawman proposal to get the conversation started.  We would not try
to submit this all at once but would rather do so via incremental
patches.  It seems most helpful to everyone to hash out a design ahead
of time rather than submitting patches without a clear understanding
of where things are going.

============================
Target system model for LLVM
============================

Abstract
========
LLVM currently has an ad-hoc method for specifying target system
parameters such as cache characteristics and software prefetching
configurations, with ``TargetTransformInfo`` interfaces to get at
them.  The following proposes to abstract such parameters into a
general target system model described by TableGen classes and easily
specified by target maintainers.

Design goals
============
- Describe target cache architectures sufficiently to allow loop-based
  cache optimization

- Describe target-specific prefetch heuristics

- Abstract target hardware memory buffers to drive streaming
  heuristics (use of non-temporal operations, etc.)

- Make the models relatively easy to implement for multiple targets

API design
==========
Six classes describe the memory model, each of which will be detailed
below.  The proposal elides implementation details, concentrating
instead on the developer API.  As we refine this and post patches
we'll of course have plenty of opportunity to discuss implementation.

- ``TargetCacheLevelInfo``
- ``TargetSoftwarePrefetcherInfo``
- ``TargetStreamBufferInfo``
- ``TargetExecutionResourceInfo``
- ``TargetExecutionEngineInfo``
- ``TargetMemorySystemInfo``

Existing interfaces in ``TargetTransformInfo`` (``getCacheLineSize``,
etc.) can be directed to the interfaces below.  Because the model is
expressed as TableGen classes (explained below), the existing
hard-coded values in e.g. ``AArch64Subtarget::initializedProperties``
can be "lifted" into a (hopefully) more flexible and
easier-to-maintain framework.

Memory system model
-------------------

``TargetCacheLevelInfo`` provides a view of a single level of cache.  It
has the following interface::

  /// TargetCacheLevelInfo - Provide information about a specific level
  /// in the cache (size, associativity, etc.).
  ///
  class TargetCacheLevelInfo {
    [...]

  public:
    /// getID() - Return the cache level ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the cache level name for debugging.
    ///
    const char *getName() const;

    /// getSize - Return the size of the cache level in bytes.
    ///
    unsigned getSize() const;

    /// getLineSize - Return the size of the cache line in bytes.
    ///
    unsigned getLineSize() const;

    /// getWays - Return the number of ways.
    ///
    unsigned getWays() const;

    /// getLatency - Return the expected latency of a load that hits in
    /// this level of cache.
    ///
    unsigned getLatency() const;
  };

``TargetSoftwarePrefercherInfo`` holds target-specific prefetching
heuristics::

  /// TargetSoftwarePrefetcherInfo - Provide information about how to
  /// configure the software prefetcher.
  ///
  class TargetSoftwarePrefetcherInfo {
    [...]
  public:
    /// getID() - Return the prefetcher ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the prefetcher name for debugging.
    ///
    const char *getName() const;

    /// Should we do software prefetching at all?
    ///
    bool isEnabled() const;

    /// Provide a general prefetch distance hint.
    ///
    unsigned getDistance() const;

    /// Prefetch at least this far ahead.
    ///
    unsigned getMinDistance() const;

    /// Prefetch at most this far ahead.
    ///
    unsigned getMaxDistance() const;
  };

``get*Distance`` APIs provide general hints to guide the software
prefetcher.  The software prefetcher may choose to ignore them.
getMinDistance and getMaxDistance act as clamps to ensure the software
prefetcher doesn't do something wholly inappropriate.

Distances are specified in terms of cache lines.  The current
``TargetTransformInfo`` interfaces speak in terms of instructions or
iterations ahead.  Both can be useful and so we may want to add
iteration and/or instruction distances to this interface.

``TargetStreamBufferInfo`` describes target hardware for memory
streaming::

  class TargetStreamBufferInfo {
    [...]

  public:
    /// getID() - Return the stream buffer ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the stream buffer name for debugging.
    ///
    const char *getName() const;

    /// getNumLoadBuffers - Return the number of load buffers available.
    /// This is the number of simultaneously active independent load
    /// streams the processor can handle before degrading performance.
    ///
    int getNumLoadBuffers() const;

    /// getMaxNumLoadBuffers - Return the maximum number of load
    /// streams that may be active before shutting off streaming
    /// entirely.  -1 => no limit.
    ///
    int getMaxNumLoadBuffers();

    /// getNumStoreBuffers - Return the effective number of store
    /// buffers available.  This is the number of simultaneously
    /// active independent store streams the processor can handle
    /// before degrading performance.
    ///
    int getNumStoreBuffers() const;

    /// getMaxNumStoreBuffers - Return the maximum number of store
    /// streams that may be active before shutting off streaming
    /// entirely.  -1 => no limit.
    ///
    int getMaxNumStoreBuffers() const;

    /// getNumLoadStoreBuffers - Return the effective number of
    /// buffers available for streams that both load and store data.
    /// This is the number of simultaneously active independent
    /// load-store streams the processor can handle before degrading
    /// performance.
    ///
    int getNumLoadStoreBuffers() const;

    /// getMaxNumLoadStoreBuffers - Return the maximum number of
    /// load-store streams that may be active before shutting off
    /// streaming entirely.  -1 => no limit.
    ///
    int getMaxNumLoadStoreBuffers() const;
  };

Code uses the ``getMax*Buffers`` APIs to judge whether streaming
should be done at all.  For example, if the number of available
streams greatly outweighs the hardware available, it makes little
sense to do streaming.  Performance will be dominated by the streams
that don't make use of the hardware and the streams that do make use
of the hardware may actually perform worse.

Class ``TargetMemorySystemInfo`` aggregates all of the information
about the target memory system::

  /// TargetMemorySystemInfo - We assume that the target defines a
  /// static array of TargetCacheLevel objects that represent all of
  /// the cache levels that the target has.  As such, we simply have
  /// to track a pointer to this array.
  ///
  class TargetMemorySystemInfo {
    [...]
  public:
    typedef ... cachelevel_iterator;

    /// getID() - Return the memory system ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the register class name for debugging.
    ///
    const char *getName() const;

    //===--------------------------------------------------------------------===//
    // Cache Level Information
    //

    /// Get the information for cache level Level (0-based).
    ///
    const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;

    /// getNumLevels - Return the number of cache levels this target has.
    ///
    unsigned getNumLevels() const;

    /// Cache level iterators
    ///
    cachelevel_iterator cachelevel_begin() const;
    cachelevel_iterator cachelevel_end() const;

    //===--------------------------------------------------------------------===//
    // Stream Buffer Information
    //
    const TargetStreamBufferInfo *getStreamBufferInfo() const;

    //===--------------------------------------------------------------------===//
    // Software Prefetcher Information
    //
    const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
  };

Execution engine model
----------------------
The memory model exists in conjunction with an execution engine model.
The execution engine currently describes how cache levels map to
execution resources.  It could hold more information outside the
memory system model (e.g. scheduling models) but currently does not.
It could become the place where information for existing
``TargetTransformInfo`` interfaces like ``getMaxInterleaveFactor``
live.

Class ``TargetExecutionResourceInfo`` provides information about a
*unit of execution*::

  /// TargetExecutionResourceInfo - Provide information about a specific
  /// kind of execution resource (core, thread, etc.).
  ///
  class TargetExecutionResourceInfo {
    [...]
  public:
    typedef ... cachelevel_iterator;

    /// getID() - Return the resource ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the resource name for debugging.
    ///
    const char *getName() const;

    /// getContained - Return information about the contained execution
    /// resource.
    ///
    TargetExecutionResourceInfo *getContained() const;

    /// getNumContained - Return the number of contained execution
    /// resources.
    ///
    unsigned getNumContained() const;

    /// Iterate over the levels of cache private to this resource.
    ///
    cachelevel_iterator cachelevel_begin() const;
    cachelevel_iterator cachelevel_end() const;
  };

Each execution resource may *contain* other execution resources.  For
example, a socket may contain multiple cores and a core may contain
multiple hardware threads (e.g. SMT contexts).  An execution resource
may have cache levels associated with it.  If so, that cache level is
private to the execution resource.  For example the first-level cache
may be private to a core and shared by the threads within the core,
the second-level cache may be private to a socket and the third-level
cache may be shared by all sockets.

Class ``TargetExecutionEngineInfo`` aggregates execution resources
into a unified view of the target::

  /// TargetExecutionEngineInfo base class - We assume that the target
  /// defines a static array of TargetExecutionResourceInfo objects that
  /// represent all of the execution resources that the target has.  As
  /// such, we simply have to track a pointer to this array.
  ///
  class TargetExecutionEngineInfo {
  public:
    typedef ... resource_iterator;

    //===--------------------------------------------------------------------===//
    // Resource Information
    //

    /// getResource - Get an execution resource by resource ID.
    ///
    const TargetExecutionResourceInfo &getResource(unsigned Resource) const;

    /// getNumResources - Return the number of resources this target has.
    ///
    unsigned getNumResources() const;

    /// Resource iterators
    ///
    resource_iterator resource_begin() const;
    resource_iterator resource_end() const;
  };

The target execution engine allows optimizers to make intelligent
choices for cache optimization in the presence of parallelism, where
multiple threads may be competing for cache resources.

Currently the resource iterators will walk over all resources (cores,
threads, etc.).  Alternatively, we could say that iterators walk over
"top level" resources and contained resources must be accessed via
their containing resources.

Subtarget APIs
--------------
Memory and execution system info are accessed through new
TargetSubtarget methods::

  /// getMemoryInfo - If memory information is available, return it.
  /// If not, return null.  Memory information can include cache
  /// hierarchy organization as well as more exotic things.
  ///
  const TargetMemorySystemInfo &getMemorySystemInfo() const {
    return *MemoryInfo;
  }

  /// getExecutionEngineInfo - If execution engine information is
  /// available, return it.  If not, return null.
  //
  const TargetExecutionEngineInfo &getExecutionEngineInfo() const {
    return *ExecutionEngineInfo;
  }

Target model implementation
===========================
In order to make describing a target memory model (relatively)
straightforward, we modified TableGen's ``SubtargetEmitter`` to allow
high-level memory model and execution resource descriptions.

Each target model inherits from a set of base TableGen classes.

Memory model base classes
-------------------------
A number of base TableGen classes exist which may be composed into
memory system descriptions::

  //===----------------------------------------------------------------------===//
  // Int - Give a name to an integer value for clarity.
  //===----------------------------------------------------------------------===//
  class Int<int value> {
    int Value = value;
  }

  //===----------------------------------------------------------------------===//
  // Cache Level - This represents a specific level within the cache hierarchy.
  //===----------------------------------------------------------------------===//
  class CacheLevel<string name, int size, int linesize, int ways, int latency> {
    string Name = name;
    int Size = size;
    int LineSize = linesize;
    int Ways = ways;
    int Latency = latency;
  }

  //===----------------------------------------------------------------------===//
  // Cache hierarchy - This models a specific cache hierarchy.
  //===----------------------------------------------------------------------===//
  class CacheHierarchy<list<CacheLevel> levels> {
    list<CacheLevel> Levels = levels;
  }

  def NoCaches : CacheHierarchy<[]>;

  //===----------------------------------------------------------------------===//
  // Common cache sizes
  //===----------------------------------------------------------------------===//
  def _1K   : Int<1024>;
  def _16K  : Int<16384>;
  def _32K  : Int<32768>;
  def _64K  : Int<65536>;
  def _128K : Int<131072>;
  def _256K : Int<262144>;
  def _512K : Int<524288>;
  def _1M   : Int<1048576>;
  def _2M   : Int<2097152>;
  def _4M   : Int<4194304>;
  def _6M   : Int<6291456>;
  def _8M   : Int<8388608>;
  def _12M  : Int<12582912>;
  def _16M  : Int<16777216>;
  def _20M  : Int<20971520>;
  def _25M  : Int<26214400>;
  def _32M  : Int<33554432>;
  def _40M  : Int<41943040>;

  class L1Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L1", size, linesize, ways, latency>;
  class L2Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L2", size, linesize, ways, latency>;
  class L3Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L3", size, linesize, ways, latency>;

  //===----------------------------------------------------------------------===//
  // StreamBuffer - This models streaming (non-temporal) resources
  //
  // Load              - The number of load buffers available.
  // MaxLoad           - The maximum number of active load streams before
  //                     streaming is shut off entirely (-1 => no limit).
  // Store             - The number of store buffers available.
  // MaxStore          - The maximum number of active store streams before
  //                     streaming is shut off entirely (-1 => no limit).
  // LoadStore         - The number of load-store buffers available.
  // MaxStore          - The maximum number of active load-store streams before
  //                     streaming is shut off entirely (-1 => no limit).
  //===----------------------------------------------------------------------===//
  def Infinite : Int<-1>;

  class StreamBuffer<int l, int ml,
                     int s, int ms,
                     int ls, int mls> {
    int Load = l;
    int MaxLoad = ml;
    int Store = s;
    int MaxStore = ms;
    int LoadStore = ls;
    int MaxLoadStore = mls;
  }

  def DefaultStreamBuffers : StreamBuffer<0, Infinite.Value,  // Load
                                          0, Infinite.Value,  // Store
                                          0, Infinite.Value>; // Load-Store

  //===----------------------------------------------------------------------===//
  // SoftwarePrefetcher - This provides parameters to software prefetching.
  //===----------------------------------------------------------------------===//
  class SoftwarePrefetcher<int e, int d, int nd, int xd> {
    int Enabled = e;
    int Distance = d;
    int MinDistance = nd;
    int MaxDistance = xd;
  }

  def DefaultSoftwarePrefetcher : SoftwarePrefetcher<0, 0, 0, 0>;

  //===----------------------------------------------------------------------===//
  // MemorySystem - This models a memory subsystem.
  //===----------------------------------------------------------------------===//
  class MemorySystem<CacheHierarchy c, StreamBuffer s, SoftwarePrefetcher p> {
    CacheHierarchy Caches = c;
    StreamBuffer StreamBuffers = s;
    SoftwarePrefetcher Prefetcher = p;
  }

  def DefaultMemorySystem : MemorySystem<NoCaches, DefaultStreamBuffers,
                                         DefaultSoftwarePrefetcher>;

Execution engine base classes
-----------------------------
Similarly, classes exist for describing and composing execution
resources.

The ``Module`` execution resource below exists for AMD's Interlagos,
which is a bit of an oddball in that independent integer units drive a
shared floating-point unit.  A ``Module`` in that case captures the two
integer units + shared FP unit while a ``Core`` models the independent
integer units.  The community may not care about this distinction or
it may be a useful abstraction for various targets.

::

  //===----------------------------------------------------------------------===//
  // Execution resource - This models a particular group of execution resources
  // (threads, cores, etc.).
  //===----------------------------------------------------------------------===//
  class ExecutionResource<string name, string contained, int num> {
    string Name = name;
    string Contined = contained;
    int NumContained = num;
  }

  //===----------------------------------------------------------------------===//
  // Common execution resources
  //===----------------------------------------------------------------------===//
  class Thread : ExecutionResource<"Thread", "", 0>;
  class Core<int numthreads> : ExecutionResource<"Core", "Thread", numthreads>;
  class Module<int numcores> : ExecutionResource<"Module", "Core", numcores>;
  class Socket<int nummodules> : ExecutionResource<"Socket", "Module", nummodules>;

  //===----------------------------------------------------------------------===//
  // Execution module - This couples an ExecutionResource with CacheLevels.
  // The given cache levels are all private to the ExecutionResource and
  // shared by any contained ExecutionResources.
  //===----------------------------------------------------------------------===//
  class ExecutionModule<ExecutionResource resource, list<CacheLevel> cache> {
    ExecutionResource Resource = resource;
    list<CacheLevel> Cache = cache;
  }

  //===----------------------------------------------------------------------===//
  // Execution engine - This models a collection of execution modules.
  //===----------------------------------------------------------------------===//
  class ExecutionEngine<list<ExecutionModule> resources> {
    list<ExecutionModule> Resources = resources;
  }

  def NoExecutionEngine : ExecutionEngine<[]>;

Example target model
--------------------
As an example, here is a definition of a memory and execution resource
model for a fictitious processor, modeled after how the X86 targets
are organized::

  //===-------------------------
  // ShyEnigma execution engine

  // Give an unlimited number of load buffers since the hardware
  // simply sets hint bits in the cache.  There are 4 hardware store
  // streaming buffers.  Set load-store buffer counts to store
  // buffer counts since that's the limiting resource.
  //
  def ShyEnigmaStreamBuffers : StreamBuffer<Infinite.Value, Infinite.Value,  // Load
                                            4,              Infinite.Value,  // Store
                                            4,              Infinite.Value>; // Load-Store

  // Enable data prefetching.  Generally prefetch 8 lines ahead but never
  // more than 32 lines ahead.
  //
  def ShyEnigmaSoftwarePrefetcher : SoftwarePrefetcher<1, 8, 0, 32>;

  // Cache lines are 64 bytes with latencies of 1, 9 and 11 cycles.
  //
  def ShyEnigmaCacheHierarchy : CacheHierarchy<[
    L1Cache<_8K.Value, 64, 4, 1>,
    L2Cache<_32K.Value, 64, 8, 9>,
    L3Cache<_11M.Value, 64, 10, 11>
  ]>;

  def ShyEnigmaMemorySystem : MemorySystem<
    ShyEnigmaCacheHierarchy, ShyEnigmaStreamBuffers, ShyEnigmaSoftwarePrefetcher
  >;

  // Cache levels 1 and 2 are private to a core and shared by threads in the core.
  // Cache level 3 is shared by all cores.
  //
  // Each ShyEnigma socket has 8 cores with 2 threads per core.
  //
  def ShyEnigmaExecutionEngine : ExecutionEngine<[
    ExecutionModule<Core<2>,   [ShyEnigmaCacheHierarchy.Levels[0],
                                ShyEnigmaCacheHierarchy.Levels[1]]>,
    ExecutionModule<Module<1>, []>, // FIXME: Module and Core definitions above
                                    // force this though it's not needed.
    ExecutionModule<Socket<8>, [ShyEnigmaCacheHierarchy.Levels[2]]>
  ]>;

Here we see one of the flaws in the model.  Because of the way
``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
to include a ``Module`` level even though it really doesn't make sense
for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
``Module`` resources.  In reality, a ShyEnigma core has two threads
and a ShyEnigma socket has eight cores.  At least for this SKU (more
on that below).

Modified ``Processor``, ``ProcessorModel`` and ``ProcModel`` classes
tie this into the rest of the target description information (again,
following X86 target conventions)::

  class Processor<string n,
                  ProcessorItineraries pi,
                  list<SubtargetFeature> f,
                  MemorySystem m = DefaultMemorySystem,
                  ExecutionEngine e = NoExecutionEngine> {
    // Name - Chip set name.  Used by command line (-mcpu=) to determine the
    // appropriate target chip.
    //
    string Name = n;

    // SchedModel - The machine model for scheduling and instruction cost.
    //
    SchedMachineModel SchedModel = NoSchedModel;

    // ProcItin - The scheduling information for the target processor.
    //
    ProcessorItineraries ProcItin = pi;

    // Features - list of
    list<SubtargetFeature> Features = f;

    // Memory - Information about the memory system.
    //
    MemorySystem Memory = m;

    // Engine - A list of execution resources describing various execution
    // contexts (threads, cores, etc.) and associated caches.
    //
    ExecutionEngine Engine = e;
  }

  class ProcessorModel<string n,
                       SchedMachineModel m,
                       list<SubtargetFeature> f>
                       MemorySystem mem = DefaultMemorySystem,
                       ExecutionEngine e = NoExecutionEngine>
      : Processor<n, NoItineraries, f, mem, e> {
    let SchedModel = m;
  }

  class ProcModel<string Name, SchedMachineModel Model,
                  list<SubtargetFeature> ProcFeatures,
                  list<SubtargetFeature> OtherFeatures,
                  MemorySystem MemoryModel,
                  ExecutionEngine EngineModel> :
    ProcessorModel<Name,
                   Model,
                   !listconcat(ProcFeatures, OtherFeatures),
                   MemoryModel,
                   EngineModel>;

  class ShyEnigmaProc<string Name> : ProcModel<Name, ShyEnigmaSchedModel,
                                               ShyEnigmaFeatures.Value, [
    ProcShyEnigma,
    FeatureHasWarpSpeed,
    FeatureHasNoStalls
  ],
                                                ShyEnigmaMemorySystem,
                                                ShyEnigmaExecutionEngine>;

  def : ShyEnigma<"shy-enigma">;

An open question is how to handle different SKUs within a subtarget
family.  We modeled the limited number of SKUs used in our products
via multiple subtargets, so this wasn't a heavy burden for us, but a
more robust implementation might allow for multiple ``MemorySystem``
and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
clear whether that's a good/necessary thing and if it is, how to
specify it with a compiler switch.  ``-mcpu=shy-enigma
-some-switch-to-specify-memory-and-execution-models``?  It may very
well be sufficient to have a general system model that applies
relatively well over multiple SKUs.

TableGen changes
================
A modified TableGen subtarget emitter understands the execution engine
base classes and knows how to emit lookup tables for target resoruces.
There is an ``Emit`` method for each type of resource table
(``EmitCacheHeirarchies``, ``EmitStreamBuffers``, ``EmitPrefetchers``,
``EmitMemorySystems``, ``EmitExecutionEngines``).

We also enhanced ``EmitProcessorLookup`` to handle the mapping of
``Processor`` descriptions to resource models.

Client usage
============
A pass can get at the memory model via ``TargetSubtaget``, existing
``TargetTransformInfo`` interfaces, by new ``TargetTransformInfo``
interfaces that forward to the ``TargetSubtarget`` APIs and by new
interfaces the system model may provide in the future.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Hi,

thank you for sharing the system hierarchy model. IMHO it makes a lot
of sense, although I don't know which of today's passes would make use
of it. Here are my remarks.

I am wondering how one could model the following features using this
model, or whether they should be part of a performance model at all:

 * ARM's big.LITTLE

 * NUMA hierarchies (are the NUMA domains 'caches'?)

 * Total available RAM

 * remote memory (e.g. RAM on an accelerator mapped into the address space)

 * scratch pad

 * write-back / write-through write buffers

 * page size

 * TLB capacity

 * constructive/destructive interference
(https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
   Some architecture have instructions to zero entire cache lines,
e.g. dcbz on PowerPC, but it requires the cache line to be correct.
Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/

 * Instruction cache



Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
<[hidden email]>:
>   class TargetCacheLevelInfo {
>     /// getWays - Return the number of ways.
>     ///
>     unsigned getWays() const;

That is, associativity?

Bandwidth might be a useful addition, e.g. if a performance analysis
tools uses the roofline model.



>   class TargetSoftwarePrefetcherInfo {
>     /// Should we do software prefetching at all?
>     ///
>     bool isEnabled() const;

isEnabled sounds like something configurable at runtime.


>     /// Provide a general prefetch distance hint.
>     ///
>     unsigned getDistance() const;
>
>     /// Prefetch at least this far ahead.
>     ///
>     unsigned getMinDistance() const;
>
>     /// Prefetch at most this far ahead.
>     ///
>     unsigned getMaxDistance() const;
>   };
>
> ``get*Distance`` APIs provide general hints to guide the software
> prefetcher.  The software prefetcher may choose to ignore them.
> getMinDistance and getMaxDistance act as clamps to ensure the software
> prefetcher doesn't do something wholly inappropriate.
>
> Distances are specified in terms of cache lines.  The current
> ``TargetTransformInfo`` interfaces speak in terms of instructions or
> iterations ahead.  Both can be useful and so we may want to add
> iteration and/or instruction distances to this interface.

Would it make sense to specify a prefetch distance in bytes instead of
cache lines? The cache line might not be known at compile-time (e.g.
ARM big.LITTLE), but it might still make sense to do software
prefetching.




>   class TargetStreamBufferInfo {
>     /// getNumLoadBuffers - Return the number of load buffers available.
>     /// This is the number of simultaneously active independent load
>     /// streams the processor can handle before degrading performance.
>     ///
>     int getNumLoadBuffers() const;
>
>     /// getMaxNumLoadBuffers - Return the maximum number of load
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadBuffers();
>
>     /// getNumStoreBuffers - Return the effective number of store
>     /// buffers available.  This is the number of simultaneously
>     /// active independent store streams the processor can handle
>     /// before degrading performance.
>     ///
>     int getNumStoreBuffers() const;
>
>     /// getMaxNumStoreBuffers - Return the maximum number of store
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumStoreBuffers() const;
>
>     /// getNumLoadStoreBuffers - Return the effective number of
>     /// buffers available for streams that both load and store data.
>     /// This is the number of simultaneously active independent
>     /// load-store streams the processor can handle before degrading
>     /// performance.
>     ///
>     int getNumLoadStoreBuffers() const;
>
>     /// getMaxNumLoadStoreBuffers - Return the maximum number of
>     /// load-store streams that may be active before shutting off
>     /// streaming entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadStoreBuffers() const;
>   };
>
> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
> should be done at all.  For example, if the number of available
> streams greatly outweighs the hardware available, it makes little
> sense to do streaming.  Performance will be dominated by the streams
> that don't make use of the hardware and the streams that do make use
> of the hardware may actually perform worse.

What count's as steam? Some processors may support streams with
strides and/or backward stream.

Is there a way on which level the number of streams are shared? For
instance, a core might be able to track 16 streams, but if 4 threads
are running (SMT), each can only use 4.

PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
hardware which streams it should establish. Do the buffer counts
include explicitly and automatically established streams? Do
non-stream accesses (e.g. stack access) count towards


>   class TargetMemorySystemInfo {
>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
>
>     /// getNumLevels - Return the number of cache levels this target has.
>     ///
>     unsigned getNumLevels() const;
>
>     /// Cache level iterators
>     ///
>     cachelevel_iterator cachelevel_begin() const;
>     cachelevel_iterator cachelevel_end() const;

May users of this class assume that a level refers to a specific
cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
search for a cache of a specific size?


>     //===--------------------------------------------------------------------===//
>     // Stream Buffer Information
>     //
>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>
>     //===--------------------------------------------------------------------===//
>     // Software Prefetcher Information
>     //
>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;

Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
level? Some ISA have multiple prefetchers/prefetch instructructions
for different levels.


>   class TargetExecutionResourceInfo {
>     /// getContained - Return information about the contained execution
>     /// resource.
>     ///
>     TargetExecutionResourceInfo *getContained() const;
>
>     /// getNumContained - Return the number of contained execution
>     /// resources.
>     ///
>     unsigned getNumContained() const;

Shouldn't the level itself specify how many of resources of its there
are, instead of its parent?
This would make TargetExecutionEngineInfo::getNumResources() reduntant.

E.g. assume that "Socket" is the outermost resource level. The number
of sockets in the system could be returned by its
TargetExecutionResourceInfo instead of
TargetExecutionEngineInfo::getNumResources().


>   };
>
> Each execution resource may *contain* other execution resources.  For
> example, a socket may contain multiple cores and a core may contain
> multiple hardware threads (e.g. SMT contexts).  An execution resource
> may have cache levels associated with it.  If so, that cache level is
> private to the execution resource.  For example the first-level cache
> may be private to a core and shared by the threads within the core,
> the second-level cache may be private to a socket and the third-level
> cache may be shared by all sockets.

Should there be an indicator whether a resource is shared or separate.
E.g. SMT threads (and AMD "Modules") share functional units, but
cores/sockets do not.


>   /// TargetExecutionEngineInfo base class - We assume that the target
>   /// defines a static array of TargetExecutionResourceInfo objects that
>   /// represent all of the execution resources that the target has.  As
>   /// such, we simply have to track a pointer to this array.
>   ///
>   class TargetExecutionEngineInfo {
>   public:
>     typedef ... resource_iterator;
>
>     //===--------------------------------------------------------------------===//
>     // Resource Information
>     //
>
>     /// getResource - Get an execution resource by resource ID.
>     ///
>     const TargetExecutionResourceInfo &getResource(unsigned Resource) const;
>
>     /// getNumResources - Return the number of resources this target has.
>     ///
>     unsigned getNumResources() const;
>
>     /// Resource iterators
>     ///
>     resource_iterator resource_begin() const;
>     resource_iterator resource_end() const;
>   };
>
> The target execution engine allows optimizers to make intelligent
> choices for cache optimization in the presence of parallelism, where
> multiple threads may be competing for cache resources.

Do you have examples on what optimizations make use of this
information? It sounds like this info is relevant to the OS scheduler
than the compiler.

> Currently the resource iterators will walk over all resources (cores,
> threads, etc.).  Alternatively, we could say that iterators walk over
> "top level" resources and contained resources must be accessed via
> their containing resources.

Most of the time programs are not compiled for specific system
configurations (number of sockets, how many cores your processor has,
or how many threads the OS allows the program to run). Meaning this
information will usually be unknown at compile-time.
What is the intention? Pass the system configuration as flag to the
processor? Is it only available while JITing?



> Here we see one of the flaws in the model.  Because of the way
> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
> to include a ``Module`` level even though it really doesn't make sense
> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
> ``Module`` resources.  In reality, a ShyEnigma core has two threads
> and a ShyEnigma socket has eight cores.  At least for this SKU (more
> on that below).

Is this a restriction of TableGen? If the "Module" level is not
required, could the SubtargetInfo just return Socket->Thread. Or is
there a global requirement that every architecture has to define the
same number of level?


> An open question is how to handle different SKUs within a subtarget
> family.  We modeled the limited number of SKUs used in our products
> via multiple subtargets, so this wasn't a heavy burden for us, but a
> more robust implementation might allow for multiple ``MemorySystem``
> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
> clear whether that's a good/necessary thing and if it is, how to
> specify it with a compiler switch.  ``-mcpu=shy-enigma
> -some-switch-to-specify-memory-and-execution-models``?  It may very
> well be sufficient to have a general system model that applies
> relatively well over multiple SKUs.

Adding more specific subtargets with more refined execution models
seem fine for me.
But is it reasonable to manage a database of all processors ever
produced in the compiler?



Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Michael, thank you for commenting!  Responses inline.

Let's continue discussing and if this seems like a reasonable way to
proceed, I can start posting patches for review.

                              -David

Michael Kruse <[hidden email]> writes:

> thank you for sharing the system hierarchy model. IMHO it makes a lot
> of sense, although I don't know which of today's passes would make use
> of it. Here are my remarks.

LoopDataPrefetch would use it via the existing TTI interfaces, but I
think that's about it for now.  It's a bit of a chicken-and-egg, in that
passes won't use it if it's not there and there's no push to get it in
because few things use it.  :)

> I am wondering how one could model the following features using this
> model, or whether they should be part of a performance model at all:
>
>  * ARM's big.LITTLE

How is this modeled in the current AArch64 .td files?  The current
design doesn't capture heterogeneity at all, not because we're not
interested but simply because our compiler captures that at a higher
level outside of LLVM.

>  * NUMA hierarchies (are the NUMA domains 'caches'?)
>
>  * Total available RAM
>
>  * remote memory (e.g. RAM on an accelerator mapped into the address space)
>
>  * scratch pad

I expect we would expand TargetMemorySystemInfo to hold different kinds
of memory-related things.  Each of these could be a memory resource.  Or
maybe we would want something that lives "next to"
TargetMemorySystemInfo.

>  * write-back / write-through write buffers

Do you mean for caches, or something else?

>  * page size
>
>  * TLB capacity

>  * constructive/destructive interference
> (https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
>    Some architecture have instructions to zero entire cache lines,
> e.g. dcbz on PowerPC, but it requires the cache line to be correct.
> Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/
>
>  * Instruction cache

These could go into TargetMemorySystemInfo I think.

> Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
> <[hidden email]>:
>>   class TargetCacheLevelInfo {
>>     /// getWays - Return the number of ways.
>>     ///
>>     unsigned getWays() const;
>
> That is, associativity?

Yes.  Naming is certainly flexible.

> Bandwidth might be a useful addition, e.g. if a performance analysis
> tools uses the roofline model.

Yes.

>>   class TargetSoftwarePrefetcherInfo {
>>     /// Should we do software prefetching at all?
>>     ///
>>     bool isEnabled() const;
>
> isEnabled sounds like something configurable at runtime.

Currently we use it to allow some subtargets to do software prefetching
and prevent it for others.  I see how the name could be confusing
though.  Maybe ShouldDoPrefetching?

>> ``get*Distance`` APIs provide general hints to guide the software
>> prefetcher.  The software prefetcher may choose to ignore them.
>> getMinDistance and getMaxDistance act as clamps to ensure the software
>> prefetcher doesn't do something wholly inappropriate.
>>
>> Distances are specified in terms of cache lines.  The current
>> ``TargetTransformInfo`` interfaces speak in terms of instructions or
>> iterations ahead.  Both can be useful and so we may want to add
>> iteration and/or instruction distances to this interface.
>
> Would it make sense to specify a prefetch distance in bytes instead of
> cache lines? The cache line might not be known at compile-time (e.g.
> ARM big.LITTLE), but it might still make sense to do software
> prefetching.

Sure, I think that would make sense.

>> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
>> should be done at all.  For example, if the number of available
>> streams greatly outweighs the hardware available, it makes little
>> sense to do streaming.  Performance will be dominated by the streams
>> that don't make use of the hardware and the streams that do make use
>> of the hardware may actually perform worse.
>
> What count's as steam? Some processors may support streams with
> strides and/or backward stream.

Yes.  We may want some additional information here to describe the
hardware's capability.

> Is there a way on which level the number of streams are shared? For
> instance, a core might be able to track 16 streams, but if 4 threads
> are running (SMT), each can only use 4.

I suppose we could couple the streaming information to an execution
resource, similar to what is done with cache levels to express this kind
of sharing.  We haven't found a need for it but that doesn't mean it
wouldn't be useful for other/new targets.

> PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> hardware which streams it should establish. Do the buffer counts
> include explicitly and automatically established streams? Do
> non-stream accesses (e.g. stack access) count towards

It's up to the target maintainer to decide what the numbers mean.
Obviously passes have to have some notion of what things mean.  The
thing that establishes what a "stream" is in the user program lives
outside of the system model.  It may or may not consider random stack
accesses as part of a stream.

This is definitely an area for exploration.  Since we only have machines
with two major targets, we didn't need to contend with more exotic
things.  :)

>>   class TargetMemorySystemInfo {
>>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
>>
>>     /// getNumLevels - Return the number of cache levels this target has.
>>     ///
>>     unsigned getNumLevels() const;
>>
>>     /// Cache level iterators
>>     ///
>>     cachelevel_iterator cachelevel_begin() const;
>>     cachelevel_iterator cachelevel_end() const;
>
> May users of this class assume that a level refers to a specific
> cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> search for a cache of a specific size?

The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
the L2 cache and so on.

>>     //===--------------------------------------------------------------------===//
>>     // Stream Buffer Information
>>     //
>>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>>
>>     //===--------------------------------------------------------------------===//
>>     // Software Prefetcher Information
>>     //
>>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
>
> Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> level? Some ISA have multiple prefetchers/prefetch instructructions
> for different levels.

Probably.  Most X86 implementations direct all data prefetches to the
same cache level so we didn't find a need to model this, but it makes
sense to allow for it.

>>   class TargetExecutionResourceInfo {
>>     /// getContained - Return information about the contained execution
>>     /// resource.
>>     ///
>>     TargetExecutionResourceInfo *getContained() const;
>>
>>     /// getNumContained - Return the number of contained execution
>>     /// resources.
>>     ///
>>     unsigned getNumContained() const;
>
> Shouldn't the level itself specify how many of resources of its there
> are, instead of its parent?
> This would make TargetExecutionEngineInfo::getNumResources() reduntant.
>
> E.g. assume that "Socket" is the outermost resource level. The number
> of sockets in the system could be returned by its
> TargetExecutionResourceInfo instead of
> TargetExecutionEngineInfo::getNumResources().

That could work I think and would probably be a bit easier to
understand.

>>   };
>>
>> Each execution resource may *contain* other execution resources.  For
>> example, a socket may contain multiple cores and a core may contain
>> multiple hardware threads (e.g. SMT contexts).  An execution resource
>> may have cache levels associated with it.  If so, that cache level is
>> private to the execution resource.  For example the first-level cache
>> may be private to a core and shared by the threads within the core,
>> the second-level cache may be private to a socket and the third-level
>> cache may be shared by all sockets.
>
> Should there be an indicator whether a resource is shared or separate.
> E.g. SMT threads (and AMD "Modules") share functional units, but
> cores/sockets do not.

Interesting idea.  I suppose we could model that with another resource
type similar to the way caches are handled.  Then the resources could be
coupled to execution resources to express the sharing.  We hadn't found
a need for this level of detail in the work we've done but it could be
useful for lots of things.

>>   /// TargetExecutionEngineInfo base class - We assume that the target
>>   /// defines a static array of TargetExecutionResourceInfo objects that
>>   /// represent all of the execution resources that the target has.  As
>>   /// such, we simply have to track a pointer to this array.
>>   ///
>>   class TargetExecutionEngineInfo {
>>   public:
>>     typedef ... resource_iterator;
>>
>>     //===--------------------------------------------------------------------===//
>>     // Resource Information
>>     //
>>
>>     /// getResource - Get an execution resource by resource ID.
>>     ///
>>     const TargetExecutionResourceInfo &getResource(unsigned Resource) const;
>>
>>     /// getNumResources - Return the number of resources this target has.
>>     ///
>>     unsigned getNumResources() const;
>>
>>     /// Resource iterators
>>     ///
>>     resource_iterator resource_begin() const;
>>     resource_iterator resource_end() const;
>>   };
>>
>> The target execution engine allows optimizers to make intelligent
>> choices for cache optimization in the presence of parallelism, where
>> multiple threads may be competing for cache resources.
>
> Do you have examples on what optimizations make use of this
> information? It sounds like this info is relevant to the OS scheduler
> than the compiler.

Sure.  Cache blocking is one.  Let's assume an L2 cache shared among
cores.  Let's also assume the program is going to use threads within a
core.  You wouldn't want the compiler to cache block assuming the whole
size of L2, you'd want to cache block for some partition of L2 given the
execution resources the code is going to use.

>> Currently the resource iterators will walk over all resources (cores,
>> threads, etc.).  Alternatively, we could say that iterators walk over
>> "top level" resources and contained resources must be accessed via
>> their containing resources.
>
> Most of the time programs are not compiled for specific system
> configurations (number of sockets, how many cores your processor has,
> or how many threads the OS allows the program to run). Meaning this
> information will usually be unknown at compile-time.
> What is the intention? Pass the system configuration as flag to the
> processor? Is it only available while JITing?

On our machines it is very common for customers to compile for specific
system configurations and we provide pre-canned compiler configurations
to make it convenient to do so.  Every 1% speedup matters in HPC.  :)

This certainly could be used in a JIT but that wasn't the motivation for
the design.

>> Here we see one of the flaws in the model.  Because of the way
>> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
>> to include a ``Module`` level even though it really doesn't make sense
>> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
>> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
>> ``Module`` resources.  In reality, a ShyEnigma core has two threads
>> and a ShyEnigma socket has eight cores.  At least for this SKU (more
>> on that below).
>
> Is this a restriction of TableGen? If the "Module" level is not
> required, could the SubtargetInfo just return Socket->Thread. Or is
> there a global requirement that every architecture has to define the
> same number of level?

No, the number of levels isn't fixed.  The issue is the way that Socket
is defined:

  class Module<int numcores> : ExecutionResource<"Module", "Core", numcores>;
  class Socket<int nummodules> : ExecutionResource<"Socket", "Module", nummodules>;

It refers to "Module" by name.  The TableGen backend picks up on this
and connects the resources appropriately.  This is definitely something
that will need work as patches are developed.  It's possible that your
idea of e.g. shared function units above could capture this.

>> An open question is how to handle different SKUs within a subtarget
>> family.  We modeled the limited number of SKUs used in our products
>> via multiple subtargets, so this wasn't a heavy burden for us, but a
>> more robust implementation might allow for multiple ``MemorySystem``
>> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
>> clear whether that's a good/necessary thing and if it is, how to
>> specify it with a compiler switch.  ``-mcpu=shy-enigma
>> -some-switch-to-specify-memory-and-execution-models``?  It may very
>> well be sufficient to have a general system model that applies
>> relatively well over multiple SKUs.
>
> Adding more specific subtargets with more refined execution models
> seem fine for me.
> But is it reasonable to manage a database of all processors ever
> produced in the compiler?

No it is not.  :)  That's why this is an open question.  We've found it
perfectly adequate to define a single model for each major processor
generation, but as I said we use a limited number of SKUs.  We will
need input from the community on this.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <[hidden email]>>
> > thank you for sharing the system hierarchy model. IMHO it makes a lot
> > of sense, although I don't know which of today's passes would make use
> > of it. Here are my remarks.
>
> LoopDataPrefetch would use it via the existing TTI interfaces, but I
> think that's about it for now.  It's a bit of a chicken-and-egg, in that
> passes won't use it if it's not there and there's no push to get it in
> because few things use it.  :)

What kind of passes is using it in the Cray compiler?


> > I am wondering how one could model the following features using this
> > model, or whether they should be part of a performance model at all:
> >
> >  * ARM's big.LITTLE
>
> How is this modeled in the current AArch64 .td files?  The current
> design doesn't capture heterogeneity at all, not because we're not
> interested but simply because our compiler captures that at a higher
> level outside of LLVM.

AFAIK it is not handled at all. Any architecture that supports
big.LITTLE will return 0 on getCacheLineSize(). See
AArch64Subtarget::initializeProperties().


> >  * write-back / write-through write buffers
>
> Do you mean for caches, or something else?

https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies

Basically, with write-though, every store is a non-temporal store (Or
temporal stores being a write-through, depending on how to view it)




> >>   class TargetSoftwarePrefetcherInfo {
> >>     /// Should we do software prefetching at all?
> >>     ///
> >>     bool isEnabled() const;
> >
> > isEnabled sounds like something configurable at runtime.
>
> Currently we use it to allow some subtargets to do software prefetching
> and prevent it for others.  I see how the name could be confusing
> though.  Maybe ShouldDoPrefetching?

isPrefetchingProfitable()?

If it is a hardware property:
isSupported()
(ie. prefetch instruction would be a no-op on other implementations)



> > Is there a way on which level the number of streams are shared? For
> > instance, a core might be able to track 16 streams, but if 4 threads
> > are running (SMT), each can only use 4.
>
> I suppose we could couple the streaming information to an execution
> resource, similar to what is done with cache levels to express this kind
> of sharing.  We haven't found a need for it but that doesn't mean it
> wouldn't be useful for other/new targets.

The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.

> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> > hardware which streams it should establish. Do the buffer counts
> > include explicitly and automatically established streams? Do
> > non-stream accesses (e.g. stack access) count towards
>
> It's up to the target maintainer to decide what the numbers mean.
> Obviously passes have to have some notion of what things mean.  The
> thing that establishes what a "stream" is in the user program lives
> outside of the system model.  It may or may not consider random stack
> accesses as part of a stream.
>
> This is definitely an area for exploration.  Since we only have machines
> with two major targets, we didn't need to contend with more exotic
> things.  :)

IMHO it would be good if passes and targets agree on an interpretation
of this number when designing the interface.

Again, from the Blue Gene/Q: What counts as stream is configurable at
runtime via a hardware register. It supports 3 settings:
* Interpret every memory access as start of a stream
* Interpret a stream when there are 2 consecutive cache misses
* Only establish streams via dcbt instructions.


> >>   class TargetMemorySystemInfo {
> >>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
> >>
> >>     /// getNumLevels - Return the number of cache levels this target has.
> >>     ///
> >>     unsigned getNumLevels() const;
> >>
> >>     /// Cache level iterators
> >>     ///
> >>     cachelevel_iterator cachelevel_begin() const;
> >>     cachelevel_iterator cachelevel_end() const;
> >
> > May users of this class assume that a level refers to a specific
> > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> > search for a cache of a specific size?
>
> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> the L2 cache and so on.

Can passes rely on it?

> >>     //===--------------------------------------------------------------------===//
> >>     // Stream Buffer Information
> >>     //
> >>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
> >>
> >>     //===--------------------------------------------------------------------===//
> >>     // Software Prefetcher Information
> >>     //
> >>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
> >
> > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> > level? Some ISA have multiple prefetchers/prefetch instructructions
> > for different levels.
>
> Probably.  Most X86 implementations direct all data prefetches to the
> same cache level so we didn't find a need to model this, but it makes
> sense to allow for it.

Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
prefetch), but a dcbt instruction is necessary to establish the cache
line into the L1 cache.






> >> An open question is how to handle different SKUs within a subtarget
> >> family.  We modeled the limited number of SKUs used in our products
> >> via multiple subtargets, so this wasn't a heavy burden for us, but a
> >> more robust implementation might allow for multiple ``MemorySystem``
> >> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
> >> clear whether that's a good/necessary thing and if it is, how to
> >> specify it with a compiler switch.  ``-mcpu=shy-enigma
> >> -some-switch-to-specify-memory-and-execution-models``?  It may very
> >> well be sufficient to have a general system model that applies
> >> relatively well over multiple SKUs.
> >
> > Adding more specific subtargets with more refined execution models
> > seem fine for me.
> > But is it reasonable to manage a database of all processors ever
> > produced in the compiler?
>
> No it is not.  :)  That's why this is an open question.  We've found it
> perfectly adequate to define a single model for each major processor
> generation, but as I said we use a limited number of SKUs.  We will
> need input from the community on this.

Independently on whether subtargets for SKUs are added, could we
(also) be able to define these parameters via the command line. Like
xlc's -qcache option.

Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Michael Kruse via llvm-dev <[hidden email]> writes:

> Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <[hidden email]>>
>> > thank you for sharing the system hierarchy model. IMHO it makes a lot
>> > of sense, although I don't know which of today's passes would make use
>> > of it. Here are my remarks.
>>
>> LoopDataPrefetch would use it via the existing TTI interfaces, but I
>> think that's about it for now.  It's a bit of a chicken-and-egg, in that
>> passes won't use it if it's not there and there's no push to get it in
>> because few things use it.  :)
>
> What kind of passes is using it in the Cray compiler?

Not sure how much I can say about that, unfortunately.

>> > I am wondering how one could model the following features using this
>> > model, or whether they should be part of a performance model at all:
>> >
>> >  * ARM's big.LITTLE
>>
>> How is this modeled in the current AArch64 .td files?  The current
>> design doesn't capture heterogeneity at all, not because we're not
>> interested but simply because our compiler captures that at a higher
>> level outside of LLVM.
>
> AFAIK it is not handled at all. Any architecture that supports
> big.LITTLE will return 0 on getCacheLineSize(). See
> AArch64Subtarget::initializeProperties().

Ok.  I would like to start posting patches for review without
speculating too much on fancy/exotic things that may come later.  We
shouldn't do anything that precludes extensions but I don't want to get
bogged down in a lot of details on things related to a small number of
targets.  Let's get the really common stuff in first.  What do you
think?

>> >  * write-back / write-through write buffers
>>
>> Do you mean for caches, or something else?
>
> https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies
>
> Basically, with write-though, every store is a non-temporal store (Or
> temporal stores being a write-through, depending on how to view it)

A write-through store isn't the same things as a non-temporal store, at
least in my understanding of the term from X86 and AArch64.  A
non-temporal store bypasses the cache entirely.

I'm struggling a bit to understand how a compiler would make use of the
cache's write-back policy.

>> >>   class TargetSoftwarePrefetcherInfo {
>> >>     /// Should we do software prefetching at all?
>> >>     ///
>> >>     bool isEnabled() const;
>> >
>> > isEnabled sounds like something configurable at runtime.
>>
>> Currently we use it to allow some subtargets to do software prefetching
>> and prevent it for others.  I see how the name could be confusing
>> though.  Maybe ShouldDoPrefetching?
>
> isPrefetchingProfitable()?

Sounds good.

> If it is a hardware property:
> isSupported()
> (ie. prefetch instruction would be a no-op on other implementations)

Oh, I hadn't even thought of that possibility.

>> > Is there a way on which level the number of streams are shared? For
>> > instance, a core might be able to track 16 streams, but if 4 threads
>> > are running (SMT), each can only use 4.
>>
>> I suppose we could couple the streaming information to an execution
>> resource, similar to what is done with cache levels to express this kind
>> of sharing.  We haven't found a need for it but that doesn't mean it
>> wouldn't be useful for other/new targets.
>
> The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.

Ok.

>> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
>> > hardware which streams it should establish. Do the buffer counts
>> > include explicitly and automatically established streams? Do
>> > non-stream accesses (e.g. stack access) count towards
>>
>> It's up to the target maintainer to decide what the numbers mean.
>> Obviously passes have to have some notion of what things mean.  The
>> thing that establishes what a "stream" is in the user program lives
>> outside of the system model.  It may or may not consider random stack
>> accesses as part of a stream.
>>
>> This is definitely an area for exploration.  Since we only have machines
>> with two major targets, we didn't need to contend with more exotic
>> things.  :)
>
> IMHO it would be good if passes and targets agree on an interpretation
> of this number when designing the interface.

Of course.

> Again, from the Blue Gene/Q: What counts as stream is configurable at
> runtime via a hardware register. It supports 3 settings:
> * Interpret every memory access as start of a stream
> * Interpret a stream when there are 2 consecutive cache misses
> * Only establish streams via dcbt instructions.

I think we're interpreting "streaming" differently.  In this design, a
"stream" is a sequence of memory operations that should bypass the cache
because the data will never be reused (at least not in a timely manner).

On X86 processor the compiler has complete software control over
streaming through the use of movnt instructions.  AArch64 has a similar,
though very restricted, capability until SVE.  dcbt is more like a
prefetch than a movnt, right?

It sounds like BG/Q has a hardware prefetcher configurable by software.
I think that would fit better under a completely different resource
type.  The compiler's use of dcbt would be guided by
TargetSoftwarePrefetcherInfo which could be extended to represent BG/Q's
configurable hardware prefetcher.

>> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
>> the L2 cache and so on.
>
> Can passes rely on it?

Yes.

>> Probably.  Most X86 implementations direct all data prefetches to the
>> same cache level so we didn't find a need to model this, but it makes
>> sense to allow for it.
>
> Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
> prefetch), but a dcbt instruction is necessary to establish the cache
> line into the L1 cache.

Yep, makes sense.

>> > Adding more specific subtargets with more refined execution models
>> > seem fine for me.  But is it reasonable to manage a database of all
>> > processors ever produced in the compiler?
>>
>> No it is not.  :)  That's why this is an open question.  We've found it
>> perfectly adequate to define a single model for each major processor
>> generation, but as I said we use a limited number of SKUs.  We will
>> need input from the community on this.
>
> Independently on whether subtargets for SKUs are added, could we
> (also) be able to define these parameters via the command line. Like
> xlc's -qcache option.

I think that would be very useful.

                            -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Hey,

I've been reading back the thread and there's a lot of ideas flying
around, I may have missed more than I should, but here's my view on
it.

First, I think this is a good idea.

Mapping caches is certainly interesting to general architectures, but
particularly important to massive operations like matrix multiply and
stencils can pull a lot of data into cache and sometimes thrash it if
not careful.

With scalable and larger vectors, this will be even more important.

Overall, I think this is a good idea, but the current proposal is too
detailed on the implementation and not enough on the use for me to
have a good idea how and where this will be used.

Can you describe a few situations where these new interfaces would be
used and how?

Some comments inline.


On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev
<[hidden email]> wrote:
> Ok.  I would like to start posting patches for review without
> speculating too much on fancy/exotic things that may come later.  We
> shouldn't do anything that precludes extensions but I don't want to get
> bogged down in a lot of details on things related to a small number of
> targets.  Let's get the really common stuff in first.  What do you
> think?

In theory, both big and little cores should have the same cache
structure, so we don't necessarily need extra descriptions for both.

In practice, sub-architectures can have multiple combinations of
big.LITTLE cores and it's simply not practical to add that to
table-gen.

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Am Do., 1. Nov. 2018 um 16:56 Uhr schrieb David Greene <[hidden email]>:
> Ok.  I would like to start posting patches for review without
> speculating too much on fancy/exotic things that may come later.  We
> shouldn't do anything that precludes extensions but I don't want to get
> bogged down in a lot of details on things related to a small number of
> targets.  Let's get the really common stuff in first.  What do you
> think?

I agree.

> > Again, from the Blue Gene/Q: What counts as stream is configurable at
> > runtime via a hardware register. It supports 3 settings:
> > * Interpret every memory access as start of a stream
> > * Interpret a stream when there are 2 consecutive cache misses
> > * Only establish streams via dcbt instructions.
>
> I think we're interpreting "streaming" differently.  In this design, a
> "stream" is a sequence of memory operations that should bypass the cache
> because the data will never be reused (at least not in a timely manner).

I understood "stream" as "prefetch stream", something that prefetch
the data for an access A[i] in a for-loop.

I'd call "bypassing the cache because the data will never be reuse" a
non-temporal memory access.

In the the latter interpretation, what does "number of streams" mean?
AFAIU the processer will just queue memory operations (e.g. for
writing to RAM). Is it the maximum number of operations in the queue?


> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> >> the L2 cache and so on.
> >
> > Can passes rely on it?
>
> Yes.

Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
L2. How would you model it instead?

Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Renato Golin <[hidden email]> writes:

> Mapping caches is certainly interesting to general architectures, but
> particularly important to massive operations like matrix multiply and
> stencils can pull a lot of data into cache and sometimes thrash it if
> not careful.

Exactly right.

> With scalable and larger vectors, this will be even more important.

True.

> Overall, I think this is a good idea, but the current proposal is too
> detailed on the implementation and not enough on the use for me to
> have a good idea how and where this will be used.
>
> Can you describe a few situations where these new interfaces would be
> used and how?

Sure.  The prefetching interfaces are already used, though in a
different form, by the LoopDataPrefetch pass.

The cache interfaces are flexible enough to allow passes to answer
questions like, "how much effective cache is available for this core
(thread, etc.)?"  That's a critical question to reason about the
thrashing behavior you mentioned above.

Knowing the cache line size is important for prefetching and various
other memory operations such as streaming.

Knowing the number of ways can allow one to guesstimate which memory
accesses are likely to collide in the cache.

It also happens that all of these parameters are useful for simulation
purposes, which may help projects like llvm-mca.

> On Thu, 1 Nov 2018 at 21:56, David Greene via llvm-dev
> <[hidden email]> wrote:
>> Ok.  I would like to start posting patches for review without
>> speculating too much on fancy/exotic things that may come later.  We
>> shouldn't do anything that precludes extensions but I don't want to get
>> bogged down in a lot of details on things related to a small number of
>> targets.  Let's get the really common stuff in first.  What do you
>> think?
>
> In theory, both big and little cores should have the same cache
> structure, so we don't necessarily need extra descriptions for both.
>
> In practice, sub-architectures can have multiple combinations of
> big.LITTLE cores and it's simply not practical to add that to
> table-gen.

I'm not quite grasping this.  Are you saying that a partcular subtarget
may have multiple "clusters" of big.LITTLE cores and that each cluster
may look different from the others?

                                 -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Michael Kruse <[hidden email]> writes:

>> > Again, from the Blue Gene/Q: What counts as stream is configurable at
>> > runtime via a hardware register. It supports 3 settings:
>> > * Interpret every memory access as start of a stream
>> > * Interpret a stream when there are 2 consecutive cache misses
>> > * Only establish streams via dcbt instructions.
>>
>> I think we're interpreting "streaming" differently.  In this design, a
>> "stream" is a sequence of memory operations that should bypass the cache
>> because the data will never be reused (at least not in a timely manner).
>
> I understood "stream" as "prefetch stream", something that prefetch
> the data for an access A[i] in a for-loop.
>
> I'd call "bypassing the cache because the data will never be reuse" a
> non-temporal memory access.

Yes, I agree the terminology is confusing.  I used the term "stream" in
the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing).
The programming model is very different, of course, but the idea of a
stream of data that is acted upon and then essentially discarded is
similar.

> In the the latter interpretation, what does "number of streams" mean?
> AFAIU the processer will just queue memory operations (e.g. for
> writing to RAM). Is it the maximum number of operations in the queue?

On X86 NT writes are to so-called "write-combined memory."
Hardware-wise, that translates to some number of buffers where stores
are collected and merged, resulting in a (hopefully) single cache
line-sized write to memory that aggregates some number of individual
stores.  The number of buffers limits the number of independent sets of
stores that can be active.  For example, if the hardware has four such
buffers, I can do this and be fine:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...

The sequence of (non-temporal) stores to each array will map to separate
hardware buffers and be write-combined in the way one would expect.  But
this causes a problem:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...
  E[i] = ...

If all of the stores are non-temporal, then at least two of the array
store sequences will interfere with each other in the write-combining
buffers and will force early flushes of the buffer, effectively turning
them into single-store writes to memory.  That's bad.

Maybe the proper name for this concept is simply "WriteCombiningBuffer."
I'm not sure if some other architecture might have a concept of store
buffers that does something other than write-combining, so I was trying
to use a fairly generic name to mean, "some compiler-controlled hardware
buffer."

There's a similar concept on the load side though I don't know if any
existing processors actually implement things that way.  I know of
(academic) architectures where prefetches fill independent prefetch
buffers and one wouldn't want to prefetch too many different things
because they would start filling each others' buffers.  That kind of
behavior could be captured by this model.

The key factor is contention.  There's a limited hardware memory buffer
resource and compilers have to be careful not to oversubscribe it.  I
don't know what the right name for it is.  Probably there will be more
than one such type of resource for some architectures, so for now maybe
we just model write-combining buffers and leave it at that.  If other
such resources pop up we can model them with different names.

I think that's what we should do.

>> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
>> >> the L2 cache and so on.
>> >
>> > Can passes rely on it?
>>
>> Yes.
>
> Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
> i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
> L2. How would you model it instead?

That seems ok to me.  As I understand it, L1P is a little awkward in
that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
really a prefetch buffer, right?  One wouldn't do, say, cache blocking
for L1P.  In that sense maybe modeling it as a cache level isn't the
right thing.

How does software make use of L1P?  I understand compilers can insert
data prefetches and the data resides in L1P, presumably until it is
accessed and then it moves to L1.  I suppose the size of L1P could
determine how aggressively compilers prefetch.  Is that the idea or are
you thinking of something else?

                            -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
On Mon, 5 Nov 2018 at 15:56, David Greene <[hidden email]> wrote:

> The cache interfaces are flexible enough to allow passes to answer
> questions like, "how much effective cache is available for this core
> (thread, etc.)?"  That's a critical question to reason about the
> thrashing behavior you mentioned above.
>
> Knowing the cache line size is important for prefetching and various
> other memory operations such as streaming.
>
> Knowing the number of ways can allow one to guesstimate which memory
> accesses are likely to collide in the cache.
>
> It also happens that all of these parameters are useful for simulation
> purposes, which may help projects like llvm-mca.

I see.

So, IIGIR, initially, this would consolidate the prefetching
infrastructure, which is a worthy goal in itself and would require a
minimalist implementation for now.

But later, vectorisers could use that info, for example, to understand
how much would be beneficial to unroll vectorised loops (where total
access size should be a multiple of the cache line), etc.

Ultimately, simulations would be an interesting use of it, but
shouldn't be a driving force for additional features bundled into the
initial design.


> I'm not quite grasping this.  Are you saying that a partcular subtarget
> may have multiple "clusters" of big.LITTLE cores and that each cluster
> may look different from the others?

Yeah, "big.LITTLE" [1] is a marketing name and can mean a bunch of
different scenarios.

For example:
 - List of big+little cores seen by the kernel as a single core but
actually being two separate cores, and scheduled by the kernel via
frequency scaling.
 - Two entirely separate clusters flipped between all big or all little
 - Heterogeneous mix, which could have different number of big and
little cores with no cache need of coherence between them. Junos have
two little and four big, Tegras have one little and four big. There
are also other designs with dozens of huge cores plus a tiny core for
management purposes.

But it's worse, because different releases of the same family can have
different core counts, change model (clustered/bundled/heterogeneous)
and there's no way to currently represent that in table-gen.

Given that the kernel has such a high influence how those cores get
scheduled and preempted, I don't think there's any hope that the
compiler can do a good job at predicting usage or having any real
impact amidst higher level latency, such as context switches and
systemcalls.

--
cheers,
--renato

[1] https://en.wikipedia.org/wiki/ARM_big.LITTLE
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Renato Golin via llvm-dev <[hidden email]> writes:

> So, IIGIR, initially, this would consolidate the prefetching
> infrastructure, which is a worthy goal in itself and would require a
> minimalist implementation for now.

Yes.

> But later, vectorisers could use that info, for example, to understand
> how much would be beneficial to unroll vectorised loops (where total
> access size should be a multiple of the cache line), etc.

Exactly!

> Ultimately, simulations would be an interesting use of it, but
> shouldn't be a driving force for additional features bundled into the
> initial design.

I agree simulation isn't the primary motivation, but it's a nice
side-effect.

We use all of these parameters today, so they are useful.

> But it's worse, because different releases of the same family can have
> different core counts, change model (clustered/bundled/heterogeneous)
> and there's no way to currently represent that in table-gen.

Yes, this is exactly the SKU problem I mentioned.  I don't have a good
solution for that other than to say that we've found that a generic
model per major subtarget can work well enough across different SKUs.
As currently constructed, the model is intended to be a resource for
heuristics, so getting things wrong is "just" a performance hit.

I guess it would be up to the people interested in a particular target
to figure out a reasonable, maintainable way to manage models for
possibly many subtargets.  This proposal is about providing
infrastructure to allow models to be created with not too much effort.
It doesn't say anything about what models for a particular
target/subtarget should look like.  :)

> Given that the kernel has such a high influence how those cores get
> scheduled and preempted, I don't think there's any hope that the
> compiler can do a good job at predicting usage or having any real
> impact amidst higher level latency, such as context switches and
> systemcalls.

Sure.  In those cases a model isn't that useful.  Not every subtarget
needs to have a model.  Alternatively, a simple "dumb" model could be
used for such targets, setting prefetch parameters, etc. to something
not totally outrageous.  The prefetcher, for example, would have to
check if a model exists.  If not, it wouldn't prefetch.

                               -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
On Mon, 5 Nov 2018 at 19:04, David Greene <[hidden email]> wrote:
> I guess it would be up to the people interested in a particular target
> to figure out a reasonable, maintainable way to manage models for
> possibly many subtargets.  This proposal is about providing
> infrastructure to allow models to be created with not too much effort.
> It doesn't say anything about what models for a particular
> target/subtarget should look like.  :)

Exactly! :)

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Hi David,

I like your idea of providing a framework to describe the cache hierarchy/prefetchers etc.
I can definitely think of a couple of ways to reuse that bit of information in llvm-mca.

I have a few questions about the general design.
It looks like in your model, you don't describe whether a cache is inclusive or exclusive. I am not sure if that is useful for the algorithms that you have in mind. However, in some processors, the LLC (last level cache) is exclusive, and doesn't include lines from the lower levels. You could potentially have a mix of inclusive/exclusive shared/private caches; you may want to model those aspects.

When you mention "write combining", you do that in the context of streaming buffers used by nontemporal operations.
On x86, a write combining buffer allows temporal adjacent stores to be combined together. Those stores would bypass the cache, and committed together as a single store transaction. Write combining may fail for a number of reasons; for example: there may be alignment/size requirements; stores are not allowed to overlap; etc. Describing all those constraints could be problematic (and maybe outside of the scope of what you want to do). I guess, it is unclear at this stage (at least to me) how much information is required in practice.

Ideally, it would be nice to have the concept of "memory type", and map memory types to resources/memory consistency models. Not sure if there is already a way to do that mapping, nor if it would improve your existing framework. In theory, you could add the ability to describe constraints/resources for memory types, and then have users define how memory operations map to different types. That information would then drive the algorithm/cost model that computes resource allocation/schedule. However, I may be thinking too much about possible use cases ;-).

That being said, I like having extra information to feed to the compiler (and llvm-mca :-)). Let see what other people think about it.

Thanks!
Andrea

On Mon, Nov 5, 2018 at 10:24 PM Renato Golin via llvm-dev <[hidden email]> wrote:
On Mon, 5 Nov 2018 at 19:04, David Greene <[hidden email]> wrote:
> I guess it would be up to the people interested in a particular target
> to figure out a reasonable, maintainable way to manage models for
> possibly many subtargets.  This proposal is about providing
> infrastructure to allow models to be created with not too much effort.
> It doesn't say anything about what models for a particular
> target/subtarget should look like.  :)

Exactly! :)

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
Andrea Di Biagio <[hidden email]> writes:

> Hi David,

Hi Andrea!

> I like your idea of providing a framework to describe the cache
> hierarchy/prefetchers etc.
> I can definitely think of a couple of ways to reuse that bit of
> information in llvm-mca.

Great!  I was hoping it would be useful for that.

> I have a few questions about the general design.

Thank you so much for your feedback.

> It looks like in your model, you don't describe whether a cache is
> inclusive or exclusive. I am not sure if that is useful for the
> algorithms that you have in mind. However, in some processors, the LLC
> (last level cache) is exclusive, and doesn't include lines from the
> lower levels. You could potentially have a mix of inclusive/exclusive
> shared/private caches; you may want to model those aspects.

Yes, I can see for simulation purposes that would be useful.  It's not
so useful in the compiler cache optimizations I have in mind because
everything is a heuristic anyway and details like inclusivity are
basically noise.

I would certainly be possible to add an inclusive/exclusive property on
a cache level.  I think that having the property implicitly reference
the next level up would make sense (e.g. L3 is inclusive of L2 or L2 is
exclusive of L1).  Then if, say, L3 is inclusive of L2 and L2 is
inclusive of L1, one could by transitivity reason that L3 is inclusive
of L1.  What do you think?

> When you mention "write combining", you do that in the context of
> streaming buffers used by nontemporal operations.
> On x86, a write combining buffer allows temporal adjacent stores to be
> combined together. Those stores would bypass the cache, and committed
> together as a single store transaction. Write combining may fail for a
> number of reasons; for example: there may be alignment/size
> requirements; stores are not allowed to overlap; etc. Describing all
> those constraints could be problematic (and maybe outside of the scope
> of what you want to do). I guess, it is unclear at this stage (at
> least to me) how much information is required in practice.

So far what I've outlined is what we've found to be sufficient for the
compiler optimizations we have, but again, for simulator modeling much
more would be desired.  On the X86, only memory mapped as WC
(write-combining) in the MTRRs actually uses the buffers, in addition to
NT stores to WB (write-back) memory.  I believe most user memory is
mapped WB so general stores won't use the write-combining buffers.
There's an ordering issue with using write-combining and it wouldn't be
safe to use it in general code without being very careful and inserting
lots of fences.

Modeling the use of MTRRs to represent different types of memory is out
of scope for this proposal but could be added later, I think.

The key thing I'm trying to model with the stream/write-combining
buffers is the limited buffer resources to handle streams of NT
operations.  These same resources would be used for WC-typed memory so
they could apply to more than NT operations.  That could be imporant for
compilers that deal with code that wants to heavily use write-combining
(like manipulating video output, for example).

> Ideally, it would be nice to have the concept of "memory type", and
> map memory types to resources/memory consistency models. Not sure if
> there is already a way to do that mapping, nor if it would improve
> your existing framework. In theory, you could add the ability to
> describe constraints/resources for memory types, and then have users
> define how memory operations map to different types. That information
> would then drive the algorithm/cost model that computes resource
> allocation/schedule. However, I may be thinking too much about
> possible use cases ;-).

Yeah, that would be nice for a number of use cases.  That level of
detail is beyond the scope of the current work but it's an interesting
idea and I'll certainly keep it in mind as I work through this.

> That being said, I like having extra information to feed to the
> compiler (and llvm-mca :-)). Let see what other people think about it.

So far it looks like positive responses with questions around API naming
and such.  I haven't heard anything show-stopping yet.  Hopefully I'll
get a chance to start posting patches soon.

                           -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Am Mo., 5. Nov. 2018 um 10:26 Uhr schrieb David Greene <[hidden email]>:

> Yes, I agree the terminology is confusing.  I used the term "stream" in
> the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing).
> The programming model is very different, of course, but the idea of a
> stream of data that is acted upon and then essentially discarded is
> similar.
>
> > In the the latter interpretation, what does "number of streams" mean?
> > AFAIU the processer will just queue memory operations (e.g. for
> > writing to RAM). Is it the maximum number of operations in the queue?
>
> On X86 NT writes are to so-called "write-combined memory."
> Hardware-wise, that translates to some number of buffers where stores
> are collected and merged, resulting in a (hopefully) single cache
> line-sized write to memory that aggregates some number of individual
> stores.  The number of buffers limits the number of independent sets of
> stores that can be active.  For example, if the hardware has four such
> buffers, I can do this and be fine:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>
> The sequence of (non-temporal) stores to each array will map to separate
> hardware buffers and be write-combined in the way one would expect.  But
> this causes a problem:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>   E[i] = ...
>
> If all of the stores are non-temporal, then at least two of the array
> store sequences will interfere with each other in the write-combining
> buffers and will force early flushes of the buffer, effectively turning
> them into single-store writes to memory.  That's bad.
>
> Maybe the proper name for this concept is simply "WriteCombiningBuffer."
> I'm not sure if some other architecture might have a concept of store
> buffers that does something other than write-combining, so I was trying
> to use a fairly generic name to mean, "some compiler-controlled hardware
> buffer."
>
> There's a similar concept on the load side though I don't know if any
> existing processors actually implement things that way.  I know of
> (academic) architectures where prefetches fill independent prefetch
> buffers and one wouldn't want to prefetch too many different things
> because they would start filling each others' buffers.  That kind of
> behavior could be captured by this model.
>
> The key factor is contention.  There's a limited hardware memory buffer
> resource and compilers have to be careful not to oversubscribe it.  I
> don't know what the right name for it is.  Probably there will be more
> than one such type of resource for some architectures, so for now maybe
> we just model write-combining buffers and leave it at that.  If other
> such resources pop up we can model them with different names.
>
> I think that's what we should do.

Thank you for the detailed explanation. We could use a notion of
"sustainable stream", i.e. the maximum number of (consecutive?)
read/write streams that a processor can support before a
disproportional loss in performance happens. This is oblivious to the
reason why that performance loss happens, be it write combining
buffers or prefetch streams. If there multiple such bottlenecks, it
would be the minimum of such streams. At the moment I cannot think of
an optimization where the difference matters (which doesn't mean there
isn't a case where it does).


> That seems ok to me.  As I understand it, L1P is a little awkward in
> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
> for L1P.  In that sense maybe modeling it as a cache level isn't the
> right thing.

The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
indeed makes no sense.

But when optimizing for it, I could not just ignore it. However, maybe
we should leave it out for our API consideration. The Blue Gene/Q is
phasing out and I know no other architecture which has this such a
cache hierarchy.


> How does software make use of L1P?  I understand compilers can insert
> data prefetches and the data resides in L1P, presumably until it is
> accessed and then it moves to L1.  I suppose the size of L1P could
> determine how aggressively compilers prefetch.  Is that the idea or are
> you thinking of something else?

I declared streams for the CPU to prefetch (which 'run' at different
speeds over the memory), which, at some point in time I can assume to
be in the L1P cache. Using the dcbt instruction, the cache line can be
lifted from the L1P to the L1 cache, a fixed number of cycles in
advance. If the cache line had to be prefetched from L2, the
prefetch/access latency would be longer (24 cycles vs 82 cycles).

Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev


On Wed, Nov 7, 2018 at 10:26 PM David Greene <[hidden email]> wrote:
Andrea Di Biagio <[hidden email]> writes:

> Hi David,

Hi Andrea!

> I like your idea of providing a framework to describe the cache
> hierarchy/prefetchers etc.
> I can definitely think of a couple of ways to reuse that bit of
> information in llvm-mca.

Great!  I was hoping it would be useful for that.

> I have a few questions about the general design.

Thank you so much for your feedback.

> It looks like in your model, you don't describe whether a cache is
> inclusive or exclusive. I am not sure if that is useful for the
> algorithms that you have in mind. However, in some processors, the LLC
> (last level cache) is exclusive, and doesn't include lines from the
> lower levels. You could potentially have a mix of inclusive/exclusive
> shared/private caches; you may want to model those aspects.

Yes, I can see for simulation purposes that would be useful.  It's not
so useful in the compiler cache optimizations I have in mind because
everything is a heuristic anyway and details like inclusivity are
basically noise.

I would certainly be possible to add an inclusive/exclusive property on
a cache level.  I think that having the property implicitly reference
the next level up would make sense (e.g. L3 is inclusive of L2 or L2 is
exclusive of L1).  Then if, say, L3 is inclusive of L2 and L2 is
inclusive of L1, one could by transitivity reason that L3 is inclusive
of L1.  What do you think?

That would be nice to have. Thanks.


> When you mention "write combining", you do that in the context of
> streaming buffers used by nontemporal operations.
> On x86, a write combining buffer allows temporal adjacent stores to be
> combined together. Those stores would bypass the cache, and committed
> together as a single store transaction. Write combining may fail for a
> number of reasons; for example: there may be alignment/size
> requirements; stores are not allowed to overlap; etc. Describing all
> those constraints could be problematic (and maybe outside of the scope
> of what you want to do). I guess, it is unclear at this stage (at
> least to me) how much information is required in practice.

So far what I've outlined is what we've found to be sufficient for the
compiler optimizations we have, but again, for simulator modeling much
more would be desired.  On the X86, only memory mapped as WC
(write-combining) in the MTRRs actually uses the buffers, in addition to
NT stores to WB (write-back) memory.  I believe most user memory is
mapped WB so general stores won't use the write-combining buffers.
There's an ordering issue with using write-combining and it wouldn't be
safe to use it in general code without being very careful and inserting
lots of fences.


Yeah. Mine was just a curiosity; the term "write combining" is kind of overloaded. Depending on the context it assumes a different meaning.
I don't think it is a problem if write combining memory is outside of the scope of this work.


Modeling the use of MTRRs to represent different types of memory is out
of scope for this proposal but could be added later, I think.

The key thing I'm trying to model with the stream/write-combining
buffers is the limited buffer resources to handle streams of NT
operations.  These same resources would be used for WC-typed memory so
they could apply to more than NT operations.  That could be imporant for
compilers that deal with code that wants to heavily use write-combining
(like manipulating video output, for example).

> Ideally, it would be nice to have the concept of "memory type", and
> map memory types to resources/memory consistency models. Not sure if
> there is already a way to do that mapping, nor if it would improve
> your existing framework. In theory, you could add the ability to
> describe constraints/resources for memory types, and then have users
> define how memory operations map to different types. That information
> would then drive the algorithm/cost model that computes resource
> allocation/schedule. However, I may be thinking too much about
> possible use cases ;-).

Yeah, that would be nice for a number of use cases.  That level of
detail is beyond the scope of the current work but it's an interesting
idea and I'll certainly keep it in mind as I work through this.

> That being said, I like having extra information to feed to the
> compiler (and llvm-mca :-)). Let see what other people think about it.

So far it looks like positive responses with questions around API naming
and such.  I haven't heard anything show-stopping yet.  Hopefully I'll
get a chance to start posting patches soon.


I look forward to see your patches soon.

Cheers,
Andrea

                           -David

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Michael Kruse <[hidden email]> writes:

> Thank you for the detailed explanation. We could use a notion of
> "sustainable stream", i.e. the maximum number of (consecutive?)
> read/write streams that a processor can support before a
> disproportional loss in performance happens. This is oblivious to the
> reason why that performance loss happens, be it write combining
> buffers or prefetch streams. If there multiple such bottlenecks, it
> would be the minimum of such streams. At the moment I cannot think of
> an optimization where the difference matters (which doesn't mean there
> isn't a case where it does).

What about load prefetching vs. non-temporal stores on X86?  There's a
limited number of write-combining buffers but prefetches "just" use the
regular load paths.  Yes, there's a limited number of load buffers but I
would expect the the number of independent prefetch streams one would
want could differ substantially from the number of independent
non-tempooral store streams one would want and you wouldn't want the
minimum to apply to the other.

I like the idea of abstracting the hardware resource for the compiler's
needs, though I think we will in general want multiple such things.
Maybe one for load and one for store to start?  For more harware-y
things like llvm-mca more detail may be desired.

>> That seems ok to me.  As I understand it, L1P is a little awkward in
>> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
>> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
>> for L1P.  In that sense maybe modeling it as a cache level isn't the
>> right thing.
>
> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
> indeed makes no sense.
>
> But when optimizing for it, I could not just ignore it. However, maybe
> we should leave it out for our API consideration. The Blue Gene/Q is
> phasing out and I know no other architecture which has this such a
> cache hierarchy.

Ok.  See more below.

>> How does software make use of L1P?  I understand compilers can insert
>> data prefetches and the data resides in L1P, presumably until it is
>> accessed and then it moves to L1.  I suppose the size of L1P could
>> determine how aggressively compilers prefetch.  Is that the idea or are
>> you thinking of something else?
>
> I declared streams for the CPU to prefetch (which 'run' at different
> speeds over the memory), which, at some point in time I can assume to
> be in the L1P cache. Using the dcbt instruction, the cache line can be
> lifted from the L1P to the L1 cache, a fixed number of cycles in
> advance. If the cache line had to be prefetched from L2, the
> prefetch/access latency would be longer (24 cycles vs 82 cycles).

Ok, I understand better now, thanks.  L1P really is a prefetch buffer
but there's software control to move it to faster cache if desired.
Should we model it as part of the prefetching API?

                              -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev

On 11/08/2018 10:35 AM, David Greene via llvm-dev wrote:

> Michael Kruse <[hidden email]> writes:
>
>> Thank you for the detailed explanation. We could use a notion of
>> "sustainable stream", i.e. the maximum number of (consecutive?)
>> read/write streams that a processor can support before a
>> disproportional loss in performance happens. This is oblivious to the
>> reason why that performance loss happens, be it write combining
>> buffers or prefetch streams. If there multiple such bottlenecks, it
>> would be the minimum of such streams. At the moment I cannot think of
>> an optimization where the difference matters (which doesn't mean there
>> isn't a case where it does).
> What about load prefetching vs. non-temporal stores on X86?  There's a
> limited number of write-combining buffers but prefetches "just" use the
> regular load paths.  Yes, there's a limited number of load buffers but I
> would expect the the number of independent prefetch streams one would
> want could differ substantially from the number of independent
> non-tempooral store streams one would want and you wouldn't want the
> minimum to apply to the other.
>
> I like the idea of abstracting the hardware resource for the compiler's
> needs, though I think we will in general want multiple such things.
> Maybe one for load and one for store to start?  For more harware-y
> things like llvm-mca more detail may be desired.
>
>>> That seems ok to me.  As I understand it, L1P is a little awkward in
>>> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
>>> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
>>> for L1P.  In that sense maybe modeling it as a cache level isn't the
>>> right thing.
>> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
>> indeed makes no sense.
>>
>> But when optimizing for it, I could not just ignore it. However, maybe
>> we should leave it out for our API consideration. The Blue Gene/Q is
>> phasing out and I know no other architecture which has this such a
>> cache hierarchy.
> Ok.  See more below.
>
>>> How does software make use of L1P?  I understand compilers can insert
>>> data prefetches and the data resides in L1P, presumably until it is
>>> accessed and then it moves to L1.  I suppose the size of L1P could
>>> determine how aggressively compilers prefetch.  Is that the idea or are
>>> you thinking of something else?
>> I declared streams for the CPU to prefetch (which 'run' at different
>> speeds over the memory), which, at some point in time I can assume to
>> be in the L1P cache. Using the dcbt instruction, the cache line can be
>> lifted from the L1P to the L1 cache, a fixed number of cycles in
>> advance. If the cache line had to be prefetched from L2, the
>> prefetch/access latency would be longer (24 cycles vs 82 cycles).
> Ok, I understand better now, thanks.  L1P really is a prefetch buffer
> but there's software control to move it to faster cache if desired.
> Should we model it as part of the prefetching API?

At this point, I'd not base any API-structuring decisions on the BG/Q
specifically. The generic feature that might be worth modeling is: Into
what level of cache does automated prefetching take place? I know of
several architectures that don't do automated prefetching into the L1,
but only into the L2 (or similar).

 -Hal

>
>                               -David
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
"Finkel, Hal J." <[hidden email]> writes:

>> Ok, I understand better now, thanks.  L1P really is a prefetch buffer
>> but there's software control to move it to faster cache if desired.
>> Should we model it as part of the prefetching API?
>
> At this point, I'd not base any API-structuring decisions on the BG/Q
> specifically. The generic feature that might be worth modeling is: Into
> what level of cache does automated prefetching take place? I know of
> several architectures that don't do automated prefetching into the L1,
> but only into the L2 (or similar).

Ok, that shouldn't be difficult to add.

                             -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: System (cache, etc.) model for LLVM

David Jones via llvm-dev
In reply to this post by David Jones via llvm-dev
Am Do., 8. Nov. 2018 um 10:36 Uhr schrieb David Greene <[hidden email]>:

> What about load prefetching vs. non-temporal stores on X86?  There's a
> limited number of write-combining buffers but prefetches "just" use the
> regular load paths.  Yes, there's a limited number of load buffers but I
> would expect the the number of independent prefetch streams one would
> want could differ substantially from the number of independent
> non-tempooral store streams one would want and you wouldn't want the
> minimum to apply to the other.
>
> I like the idea of abstracting the hardware resource for the compiler's
> needs, though I think we will in general want multiple such things.
> Maybe one for load and one for store to start?  For more harware-y
> things like llvm-mca more detail may be desired.

Your RFC already has getNumStoreBuffers, getNumLoadBuffers and
getNumLoadStoreBuffers, no? As far I understand, write-combining only
applies to getNumStoreBuffers(). Prefetch streams would limit
getNumLoadBuffers.

Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
12