codeplaysoftware / standards-proposals Goto Github PK

View Code? Open in Web Editor NEW

27.0 22.0 17.0 551 KB

Repository for publicly sharing proposals in various standards groups

License: Apache License 2.0

Makefile 6.89% C++ 93.11%

proposal codeplay feedback

standards-proposals's People

Contributors

Stargazers

Watchers

Forkers

keryell progtx aerialmantis lukeiwanski mhoemmen joyxu hcedwar naghasan romanovvlad adambrouwersharries alexander-johnston nyalloc steffenlarsen mmha ewanc xcleancode

standards-proposals's Issues

CP013: Thoughts about equality of execution resources

Folowing on from our discussion regarding the lifetime of execution resources I have recently been thinking about how we should define equality of execution resources.

It would be very useful for users to be able to compare one execution resource against another, to check if they are pointing to the same underlying resource. However, this begs another quetion of whether execution resources should be required to be consistent identifiers.

For example, if you were to discover the system topology multiple times, should the same hardware resources always be represented by comparible execution resources? Or if you were to construct a particular type of execution context that does not require that it is constructed from an execution resource, but can return an execution resource, should it possible to compare this resource against the equivelant from a system topology discovery?

CP013: Introduce query for memory region properties

The current interface for affinity queries supports querying latency, bandwidth, capacity and power consumption. However, there is currently no way to query for the relationship between an execution resource and a memory resource for memory region properties such as whether they support atomic operations for concurrent access to the memory.

We may be able to support this with the existing affinity query interface, though this will need to be investigated.

CP013: Add further background discussion

The background section is still missing discussions about NUMA architectures, Chapel and other PGAS models from various discussions on these topics. We should add further discussion of these for the next revision of the paper.

CP013: Affinity asymmetry between execution and memory resource

The Affinity execution_context has asymmetry in that it is constructed from an execution resource but supplies a memory resource. Should relationship be between execution resources and memory resources without an intervening execution context?

CP013: Differences in Affinity and Context papers

Differences in the intersection of papers Affinity - D0796r2 and Context - P0737r0.

std::thread specific resource: Affinity defines execution_resource which is implied to be an execution resource that executes std::thread. Context defines thread_execution_resource_t which is explicit about executing std::thread. Question: Should we lay the foundation for non-std::thread execution resources (e.g, GPU) by embedding the name thread in the initially proposed execution resource?
Affinity execution_resource is moveable and copyable, Context thread_execution_resource_t is neither moveable or copyable. Question: Should an execution resource be a PIMPL (pointer to implementation) value type or an immutable reference? Creating and managing vectors of execution resources requires PIMPL .
Affinity has std::vector<resource> execution_resource::resources() const noexcept; and Context has const thread_execution_resource_t & thread_execution_resource_t::partition( size_t i ) const noexcept ; . Related to the PIMPL question.
Affinity has std::vector<execution_resource> this_system::resources() noexcept; and Context has thread_execution_resource_t program_thread_execution_resource ; . Question: Should there be a root execution resource that is a handle to the union of all individual execution resources? A root execution resource, by definition, has a vector of nested resources which is equivalent to the Affinity proposal's root vector of resources.
Affinity execution_context has name(). Is this for the type of resource, denote the topological identity of the resource, or both? Context avoided the design point, for the initial proposal.
Affinity can_place_* methods open the design point of placement without addressing how to query if placement has occurred or how placement can be performed. Context avoided the design point in the initial proposal.
Affinity execution_context is constructed from an execution_resource without additional properties, implying 1-to-1 correspondence. Context execution context (concept) construction is undefined.

CP013: Extend the bulk_execution_affinity property with displacement unit

In the Rapperswil feedback, it was suggested that the bulk_execution_affinity interface should be extended to also incorporate the unit of displacement, for example, whether to scatter by core or socket, etc.

This could be introduced via a new property of a parameter to the existing property.

Though we should also consider that introducing this may require us to enumerate types of execution resource, such as threads, cores and sockets.

CP013: Change of title (P1795)

It was suggested by SG1 at the Prague meeting, that we should change the paper name to be more inclusive of other domains as well as heterogeneous and distributed computing. Perhaps we could just change it to simply "System topology discovery for C++"?

CP013: Provide clarity on the high-level interface

There was feedback from Jacksonville that not all users would want to dive into the fine-grained work of querying a system's topology and manually binding resources, and that many users would instead want a higher-level descriptive interface which allows the implementation to decide how to allocate resources based on the user's requirements. There were concerns that we don't yet clearly present what a high-level interface for affinity would look like.

For Rapperswil we should provide clarity on the different levels of interface that we are proposing and their target users, and also on how we envision a high-level interface for affinity C++ to look like, preferably with examples.

CP013: Create a pseudo list of properties/queries (P1795)

One aspect of the feedback from SG1 on P1795r1 at Belfast was that we need to demonstrate how the abstract topology discovery interface proposed would work in practice, and provide some examples of how properties of the system topology could be used generically within applications.

I think one of the first steps in this is to identify the potential abstract properties or queries that could be expressed generically, i.e. not pertaining to any particular kind of processor or system component.

So far I have the following list:

Type of resource, i.e. execution, memory, network_io
Parent-child connections between resources to identify hierarchy relationships.
Processor-memory connections between resources to identify memory access capabilities.
Processor-io connections between resources to identify external capabilities such as network ports.
Available concurrency at a particular point in a hierarchy, i.e. the number of concurrent threads of execution available to that resource.
Support for SIMD execution and the available SIMD ABIs and widths.
Hardware concurrency (mapped to the existing function of the same name).
Hardware constructive interference size (mapped to the existing function of the same name).
Hardware destructive interference size (mapped to the existing function of the same name).
Prefered bulk execution shape.
Maximum bulk execution shape.
Method of bulk execution work batching.
Prefered size for bulk execution work batching.
Relative latency of accessing different memory resources from different execution resources.
Support for pinned/shared memory between execution resources.
Support for exceptions.

We don't need to propose all of these properties now, but we can prepare a pseudo list of properties or queries for expositional purposes to demonstrate how algorithms could take advantage of this interface.

CP013: Add way to retrieving current execution resource/context

There was feedback from Jacksonville suggesting that we add a way to retrieve the execution resource or execution context of the current thread of execution.

CP013: Allow constructing an execution context from a range of execution resources

It was raised that the current design limits the creation of an execution_context to a single execution_resource therefore enforcing that the execution_context represent all member execution_resources. This excludes the case where you may want to, for example, create an execution_context from half the threads of a core. In this case, you would want to list the execution_resources you want the execution_context to represent.

For this we need to add an additional constructor to the execution_context or alter the existing constructor to allow it to take a set of execution_resources. Perhaps we could do this by providing partitioning algorithms which can partition an execution_resource in a particular way and return a new iterable which could then be passed to the execution_context constructor.

CP013: Support dynamic device discovery

There was feedback from Jacksonville that we should start looking into how to support dynamic device discovery, where an execution resource within a system may become available or unavailable during execution.

In order to reliably support this feature there needs to be a guarantee that if a device can become available or unavailable during execution, that this can be handled gracefully. For this reason we should aim to make this feature optional so that implementations which cannot handle dynamic device discovery gracefully can opt to not support it.

Supporting dynamic device discovery would also mean that the system topology may change between one query and another. So there needs to be a way for users to be notified of a change to the system topology through some form of callback mechanism, and there needs to be a way for for users to update the topology information when this happens.

Another option could be to not be specific about whether an execution resource is dynamic or static, but simply have some execution resources to be updated by dynamic device discovery.

CP013: Stalled!

@AerialMantis (cc: @Ruyk)
We are stalled on D0796r2 waiting for you to weigh in with reviews on pull request #52 and others.

CP013: Add further discussion of topology structure

There was feedback from Jacksonville that while the internal structure of a system's topology when being queried is inherently hierarchical it's generally desired that the user interface not be hierarchical.

For Rapperswil, we should add further discussion on the requirements of the system topology structure and highlight that the more hierarchical like structure is only for the low-level interface and that the high-level interface would be more descriptive.

CP013: Create a motivational example (P1795)

In our last discussion, we decided that we should create a motivational example of how a developer could use the topology discovery design proposed in P1795 to optimise an algorithm such as matrix multiply based on different system architectures.

CP013: Remove this_thread::bind and this_thread::unbind

Feedback from the Rapperswill meeting suggested that we should remove the this_thread::bind and this_thread::unbind interface, as it is too open to misuse and also conflicts with other more desireable approaches.

CP013: Make resources iterable

We decided that in order to avoid having to return a container of member resources from each resource, we should instead make the execution_resource (and the subsequent memory resource type) iterable. This means replacing the resources member function with begin an end member functions. Perhaps we should also define iterator traits for the resource types.

CP013: What do we want the execution_context to be?

In the current revision of the paper the execution_context is lightly defined, and simply provides a way construct a context around a particular part of the system topology in order to get an executor or allocator with affinity to specific resources.

After discussing this design with Chris K on the executor telecom he raised the very good point of what the execution_context is trying to be. Is it (a) the execution context; a polymorphic type which can serve as a wrapper for other concrete execution context types such as a thread pool, in the same way, the polymorphic executor does. Or is it (b) simply another concrete execution context type like static thread pool type is that is specifically designed for managing the resources of a discoverable topology.

Both of these are reasonable goals though they have very different scopes. If we were to aim for (a) then the scope is much larger, the execution_context must be fully compatible with all concrete execution context types, which would likely mean introducing explicit properties which can be mapped to the various properties of the concrete execution context types. If we were to aim for (b) then the scope is less, the execution_context can be limited to functionality which is required for managing resources discoverable in the topology. Additionally, if we were to aim for (b) we should probably rename the execution context type to something like resource_execution_context.

Personally, I think we should aim for (b) as it is a more limited scope, and trying to define a more generic execution context type means making the design compatible with many other concrete execution context types, which atm there are not many of. I feel it may be too early to try to define what a standard execution context should look like.

CP013: Feedback from SG14 (P2155)

Feedback from SG14 on the proposed wording of P2155r0:

Paper should specify that the target ship vehicle is on top of P0443.
We should remove mention of hardware and virtual machines from the proposed wording, have it described in the proposal summary above instead.
There needs to be a term of art for "bulk execution", and consider referring to it as a collection of work-items or callables, this is also missing from P0443.
There needs to be a term of art for "work-item", this is also missing from P0443.
There needs to be a common term of art for "constructive interference" and "destructive interference".
We should have a section which specifies how the existing algorithms
We should change "must" to "shall" in the normative text.
We should change text that says "should" in the notes to normative encouragement in the normative text.
We should only use italics in normative text when we first define a term.

CP013: Add HMM to background research for paper

HMM (Heterogeneous Memory Management) is a proposed interface for supporting non-conventional memory architectures into the regular kernel path. We should look into this as background research for the paper.

https://github.com/torvalds/linux/blob/master/Documentation/vm/hmm.rst

CP013: Topology resources awareness of contention (P1795)

Some feedback was received in the Belfast meeting that it would be useful to identify whether resources within a system topology are contested and be able to discover only the parts of the system topology which are non-contested, perhaps via some kind of flag.

The current design doesn't make any guarantee as to whether the resources reflected in the system topology are uniquely available and uncontested by another part of the application or another process. I would be beneficial to define when users can expect to have uncontested access to resources when it's possible for the implementation to do so and provide a way to only discover resources that are available. Though this might have to be done at a fine-grained
level as some resources may not be able to reflect this information and some resources may only be partially contested, for example, a bounded thread pool may take a specific number of threads.

I can think of three different situations where this information could be available when discovering the system topology:

If resources are contested by this C++ interface it should be very straight forward to keep track of this.
If resources are contested by an OS or third-party API then it's a little more complicated, but if the C++ implementation uses the same API or is aware of that API then it may be able to identify if resources are contested during discovery.
If resources are contested by either the C++ interface or an OS or third party API in another process it may not be possible to identify if resources are contested, as some APIs do not make any guarantees cross-process.

Perhaps this is something which needs to be queryable on a per-resource basis.

Another point to consider here is that whether resources are contested could change dynamically, so it would have to factor in consideration about how the topology is updated and how users are notified of changes.

CP013: Injecting topology information (P1795)

Some feedback from the Belfast meeting was that it would be useful to have configuration providers which can inject information about resources into the system topology relevant to a particular environment.

Some initial thoughts, I wonder if such an interface could be used to add entirely new resources to the topology or to simply add additional information to resources. I think the latter should be relatively straight forward, providing the resources available in the system match the expectations of the configuration provider. The former may be more complicated, adding new resources could be fairly trivial if we provide a way to create resources and populate its information, the difficulty
would come in when defining connections or possible contentions with existing resources in the system topology.

It was pointed out that this is also a nice solution to the problem we have of how to define non-standard domain-specific identification of the abstract C++ resources. If configuration providers can see the topology when injecting information then they could be used to provide concrete labels for specific resources, even by just checking their names. Then these configuration providers could be provided open-source supplementary to the standard.

CP001: Missing Update in device

In the new update methods of the handler (to/from device), seems that the case where buffer contents are updated with other buffer contents is missing.

CP013: Return type for `traverse_topology` (P1795)

Following on from a discussion here and here in prior to P1795r1 about the return type of traverse_topology.

The two alternatives considered were to either have traverse_topology return a vector<system_resource> which requires system_resource to be copy constructible or to have traverse_topology return a ranges::view<system_resource> so that the collection can be further processed lazily after it is returned. We also discussed the possibility of combining the best of both, by having system_resource be semiregular and then returning a ranges::view<system_resource> that is temporarily tied to the lifetime of the system_resource but capable of being assigned to a container such as a vector after any lazy transformation is done.

This also raised the question of whether the topology information contained within a system_topology object is static, I believe we are leaning towards this being the case to avoid the topology being modified asynchronously while it's being inspected.

We should continue the discussion of this and clarify this in P1795r2.

CP013: Calrify the requirements for the copy and move semantics of the `execution_resource`

Currently, the requirements for the execution_resource are quite vague:

[Note: The intention is that the actual implementation details of a resource topology are
described in an execution context when required. This allows the execution resource objects
to be lightweight objects that serve as identifiers that are only referenced. --end note]

In #40 we decided that the execution_resource should be copyable and moveable but must act as an opaque type with reference counting semantics.

Taken from #40:

Answering the second point, the execution_reosurce should remain copyable and moveable so that it can be used within std alogrithms, but it should be an opaque type with reference counting semantics.

Perhaps we want to introduce normative wording which requires certain behaviour of the execution_resource when being copied or moved in order to guarantee the corerect behaviour.

CP004: Can Placeholder accessor be default constructible?

Feedback from some users after trying out the placeholder accessors seems to indicate that they should be default constructible, and that the buffer should be assigned later during the requirement setting stage.

CP013: New motivation front matter for P1437

In the last call we discussed the direction to go in for P1437: System topology discovery for heterogeneous & distributed computing, now that it's been split off from P0796. We looked at some of the use-cases for having a low-level affinity interface in C++ and what we would like such an interface to look like. We decided that based on the feedback from Kona we should refocus the motivation and goals of the proposal for a low-level affinity interface in the first revision of P1437.

Some of the benefits of a low-level affinity interface in C++ that we discussed were:

Having OpenMP affinity model be standard within C++.
Being able to bind execution and allocation.
Being able to retrieve reliable information across different platforms.
Being able to write standard C++ code that can work across multiple platforms and architectures.

We discussed that having a standardized interface in C++ for querying the topology of a system for its execution resources and the affinity relationships between those resources would be highly beneficial for writing generic code that can target heterogeneous platforms. However, we also recognised that expecting C++ to keep up with the rapidly changing and developing architectures within heterogeneous computing domains and to support their various unique features and capabilities is unrealistic. To this end, we would like to aim instead for C++ to provide a unified layer between future hardware standardization efforts like HMM and executor based programming models such as threads pools, SYCL or Kokkos. This would provide a middle layer for users and library implementors to target in order to write more generic and potentially "performance portable" applications and programming models, whilst also providing hardware vendors with a way to extend the interface to provide support for the more unique features and capabilities of their architectures.

We discussed concerns that the current C++ abstract machine and the language around it are just not sufficient for describing heterogeneous systems. So while expecting the C++ abstract machine to be completely revamped to cover a range of different hardware features and capabilities is unrealistic, there will have to be some new language introduced to allow C++ to describe the system topologies that are being queried. We noted that this is something that is even becoming evident in P0443, the unified executors proposal, where it's proving difficult to express certain properties in the language the C++ abstract machine currently provides.

Closely related to this we discussed the move towards a unified address space in heterogeneous systems via SVM and HMM. We made the point that this move actually makes the case for affinity in C++ stronger, because while you have different address spaces, the distinction between different hardware memory regions and their capabilities are clear, but once you have a single unified address space, potentially with cache coherency, distinguishing different memory regions becomes much more subtle. Therefore it becomes much more important to understand the various memory regions and their affinity relationships in order to achieve good performance on various hardware.

We also discussed one the more controversial aspects of P0796, that being the current representation of the system topology, still being largely hierarchical, as closely based on Hwloc. While Hwloc is highly used in many domains, it now does not always accurately represent existing machines, because it's structure is strictly hierarchical, while many machines no longer have a simply hierarchical topology. To solve this we discussed a potential graph representation for a system topology where you have node relationships that represent the containment relationships of machines, sockets, CPUs, etc, but also have node relationships that represent network and memory region connections. So the graph becomes more of an opaque system representation that can be viewed from a number of different perspectives, depending on what relationships you are interested in.

Going forward here I think we should have some further discussion of the motivation and goals and perhaps decide on some clear use cases and then at some point I would like to put together a merge request for updating the front matter of P1437, and perhaps take out the proposed interface for now.

CP013: Introduce tags for providing abstract information about a resource (P1795)

As discussed in #40 we need to provide a way for users to identify the type of an execution_resource.

Taken from #40:

Answering the first point, the execution_resource should be a generic execution resource type that isn't associated with any particular type of resource, however we should introduce someway of identifying what kind of resource a particular execution_resource is. A runtime approach would be favourable over a compile-time approach, firstly as many low-level APIs which provide access to a system's topology such as Hwloc, HSA and OpenCL and runtime discoverable so a compile-time interface would not be suitable for expressing this, and secondly because having a compile-time interface would mean introducing a large number of types, which would reduce or complciate the ability to store resources generically.

CP008: Add mutex buffer property

With the new proposal for buffer properties the optional mutex parameter of the buffer constructor can instead be provided as a property. This will reduce the number of buffer constructors and allow the mutex to be used in combination with other constructors not possible before.

CP013: Updated wording for bulk_execution_affinity properties (P1436)

One comment that was made in Belfast was that the naming of the properties reflects an older revision of OpenMP, so one of the first things I propose for this paper is to update the naming to that of OpenMP 5.0.

This would mean:

bulk_execution_affinity.none (remane the same)
bulk_execution_affinity.scatter -> bulk_execution_affinity.spread
bulk_execution_affinity.compact -> bulk_execution_affinity.close
bulk_execution_affinity.balanced (remane the same)

I also wanted to clarify the meaning of the concurrency property in P1436r2, particularly as I think this could be relevant to the wording of the bulk_execution_affinity properties. The intention is that it represents the maximum available concurrent execution agents available to an executor when used in a single invocation of execution::bulk_execute. This does not guarantee that these execution agents will always be created with concurrent forward progress and it also assumes that the execution resources are uncontested by other executors or third party libraries. One concern with this definition that we may want to address is that it does not allow any control over the domain or level of the hierarchy it is applied to, so you cannot use this property for nested calls to execution::bulk_execute with different affinity binding as you would in say OpenMP, so this is perhaps something we want to address.

For the wording of the bulk_execution_affinity properties, I have drafted initial wording based on the discussions in Belfast (I hope I accurately captured the direction we were going in). The basis of this wording is the assumption that an invocation of execution::bulk_execute(e, f, s) creates a consecutive sequence of work-items from 0 to s-1, mapped to the available concurrency, that is some number of execution resources, which are subdivided in some implementation-defined way.

_Property	Wording
bulk_execution_affinity.none	A call to execution::bulk_execute(e, f, s) is not required to bind the created execution agents for the work-items of the iteration space specified by s to execution resources.
bulk_execution_affinity.close	A call to execution::bulk_execute(e, f, s) should aim to bind the created execution agents for the work-items of the iteration space specified by s to execution resources such that the average locality distance between adjacent work-items is minimized. Only binding subsequent execution agents to a resource if no other resources would otherwise result in fewer execution agents being bound to it.
bulk_execution_affinity.spread	A call to execution::bulk_execute(e, f, s) should aim to bind the created execution agents for the work-items of the iteration space specified by s to execution resources such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is maximized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. Only binding subsequent execution agents to a resource if no other resources would otherwise result in fewer execution agents being bound to it.
bulk_execution_affinity.balanced	A call to execution::bulk_execute(e, f, s) should aim to bind the created execution agents for the work-items of the iteration space specified by s to execution resources such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is minimized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. Only binding subsequent execution agents to a resource if no other resources would otherwise result in fewer execution agents being bound to it.

Note: the subdivision of the available concurrency is implementation-defined.

Note: when the number of work-items is greater than the available concurrency, the binding should wrap following the same subdivision._

We may want to reconsider the terms "concurrency" and "locality distance" in the above wording, another suggestion during the SG1 session was to incorporate the idea of "interference", used in the existing hardware_[constructive|destructive]_interferance queries.

Additionally, the current behaviour when the number of work-items is greater than the available concurrency the binding should wrap, however, we may wish to define further properties for alternative chunking patterns.

This proposed wording was also sent to the SG1 mailing list to start a discussion there.

CP013: Abrupt transition to Proposed Wording

I feel we do that switch rather fast without some overall high level design description. Or may be it is scattered through the Proposed Wording.

CP013: Work on placing work close to execution

I didn't find that part in the paper, are we not posting that part for review?

CP013: Expose the magnitude of relative affinity queries through a different interface

It was suggested at the Rapperswil meeting that we should consider alternatives to having the affinity_query comparison operators return a size_t which describes the magnitude of the relative affinity and having the comparison operators return a boolean.

CP013: Update contributor list for D0796r2

Update the contributor list for D0796r2.

CP013: Remove polymorphic allocator

We decided that the polymorphic allocator does fit the requirements we have for affinity-based allocation, so we should drop the pmr_memory_resource_type from the paper and just leave the allocator_type.

CP013: Who should have control over bulk execution affinity?

Whilst adding the bulk_execution_affinity properties to the paper there was some discussion about who should be responsible for specifying the properties, the execution context, the executor or both.

(taken from #48):

Ruyk:

Let me see if I understand this properly. Assuming this is a simple fixed size thread pool of size 4 underneath:

The thread pool itself is created and "maintained" by the execution context. That means the threads are created on construction of the execution context. At this points, threads are bound to wherever execution(s) resource(s) have been indicated on construction of the executionContext, if any.

Then we perform a require for a bulk executor following a scatter policy. What do you expect to happen?
A. The threads of the thread pool are re-bound following the scatter placement of threads per resource
B. The new affExec will enforce placement of execution agents on the thread pool threads following the scatter policy.
C. Neither of the above
My understanding is B from the proposal.
However, if what I want to do is place the actual execution threads following scatter policy on the given cores, I would need to pass the policy on construction of the execution context, rather than the executor (since the thread pool has been created already). I could potentially re-bind threads after they have been created, but that has a cost that could be avoided if the initial placement of resources of the thread pool is done on construction. Can we have an alternative constructor where these policies are passed to the execution context? If we are having a fine-grain selection of member_of resources, this will allow for the high-level interface to be used in that case.

Now we call bulk_execute with the callable. How is each execution agent placed onto an execution resource? If the executor property is thread_execution_mapping_t or other_execution_mapping_t , that means the executor will query the placement (presumably an id) and place agents on existing threads. Which one is a valid placement?
A. Agent 0 in Thread 0, Agent 1 in Thread 2, Agent 2 in Thread 1, Agent 3 in Thread 3, then loop over again for the remaining 4 agents
B. Agent 0 in Thread 0, Agent 1 in Thread 3, Agent 2 in Thread 1, Agent 3 in Thread 3, then loop over again for the remaining 4 agents
C. Either of the above
My understanding is A will be correct, or at least commonly expected.
However, when oversubscribing agents on threads is not clear what to expect when using these policies. Will the executor hold execution of an agent until the resource "far away" is available or will execute consecutively if possible?
Also note that the same execution context (in this case, the same thread pool) may be in use by multiple executors. What is the expectation in terms of placement of agents on threads?

AerialMantis:

The way I see this working is that if you were to construct an execution context, say the static_thread_pool and initialise it with a set number of threads of execution, such as 4. By default, the bulk_execution_affinity property would be none so the thread pool wouldn't be required to make any kind of guarantee as to how it's threads of execution are bound to the underlying resources. It could automatically bind each of its threads of execution to a certain resource if it makes sense for it to do so, but it wouldn't have to. It would only be when a different bulk_execution_affinity property was requested by an executor that it would be required to perform binding in a particular pattern. So when this happens if the thread pool had already performed binding then it may have to rebind to achieve the binding pattern requested by the executor.

The reason for having these properties on the executor rather than the execution context is to make it more usable as a high-level feature. There are some executor implementations which will not make use of an execution context, such as inline executors which are just created as concrete executor types and used without referring to an explicit execution context, an execution context still exists though it's implicit.

I think we should have a more fine-grained interface for configuring execution contexts where you can specify specific affinity binding on construction. In this case, the executor which the execution context provides would only support the bulk_execution_affinity property which the execution context was constructed with, so could not be requested to use another, essentially becoming query only. So essentially this would mean that you could specify affinity binding at the execution context level or the executor level, but specifying it at the execution context level would take precidence over the executor level as the executor would inherit the property.

In terms of how the affinity binding is implemented within bulk_execute, I would also agree that A would be the most commonly expected, though I think it would be okay to allow an implementation to decide, providing it matched the requirements of the pattern and was consistent across invocations of bulk_execute. For oversubscribing agents I would expect that you would have multiple agents mapped to the same threads of execution, so only one agent could progress at a time. As to how those agents are ordered I think that would be at the discretion of the implementation, though implementations would be expected to order agents in way which is most efficient for the requesting binding pattern.

You raise an interesting point here in how an execution context will deal with multiple executors submitting work to it. In terms of affinity binding patterns, I think I would expect that if one executor was to request one bulk_execution_affinity property and another executor was to request a different property then one task would have to be scheduled before the other with the affinity binding pattern being altered between tasks. Though we would recommend users not do this frequently as it could be inneficient to keep rebinding the threads of execution. If both tasks were to be using the same affinity binding pattern, then they could conceivably be overlayed, though I think this would be at the discretion of the implementation as to whether this could be done efficiently.

Ruyk:

So essentially this would mean that you could specify affinity binding at the execution context level or the executor level, but specifying it at the execution context level would take precedence over the executor level as the executor would inherit the property.
I guess the opposite would make more sense: The one closer to the user is the one that should make the final decision.

I guess my concern is that associating the binding with the executor may incur in extra costs, since the executor dispatch may be called in a part of the code that is performance critical, whereas the executor context can be defined anywhere.

Apart from that, what you says makes sense to me. Since this is an exploratory paper, I am happy if this is merged with the minor change above and just add a straw poll for deciding where we should do the affinity (executor context, executor or both).

AerialMantis:

Yeah, that's a good point, perhaps what we want then is for the executor to be able to override the behaviour of the execution context. Maybe then the property of the execution context could be inherited by the executor as the default behaviour, but can still be altered by the executor.

I think that's a fair concern, that putting control over affinity binding at the executor level could incur costs at the wrong time, though if we also add the ability to configure the execution context as well with the same property then this should alleviate that, as the cost will be paid at configuration time.

Okay great, I will add some notes to the paper covering some of the points we discussed here and I will add a straw poll for where the affinity binding capability should go.

CP013: Ensure the affinity proposal aligns with the mdspan proposal

It was raised at the Rapperswil meeting that the proposal for mdspan (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0009r7.html) also provides mechanisms for specifying how memory is laid out and that we should ensure that the affinity proposal aligns with this.

CP013: Support querying the load factor of an execution context

There was feedback from Jacksonville that it would be useful to be able to query the load factor of an execution context, as to make a decision based on the current load of different contexts.

CP013: Missing a memory context if there are separate memory resources?

I wonder if we need a memory_context similar to the execution_context that has the allocation capabilities, e.g, you retrieve an allocator that is bound to the memory resource.
This allows implementation to add the machinery required to allocate or to bind allocation on the memory_context itself.

CP013: Support partial errors in topology discovery

In the current revision of the paper (r3) the discover_topology function is permitted to throw an exception in case of a failure in discovering the system's topology. However, this could be problematic as this failure could prevent a library dependant on this discovery from functioning, even if the failure had nothing to do with the resources the library was looking to utilise.

A solution to this could be support partial error in topology discovery, where calling discover_topology could be permitted to fail but still return a valid topology structure representing the topology that was discovered successfully. The way in which these errors are reported (i.e. exceptions or error values) would have to be decided, exceptions could be problematic as it could unwind the stack before capturing important topology information.

CP013: Affinity between execution and memory resource

Affinity has affinity_query between two execution resources. Recommend this be between an execution resource and a memory resource.

CP013: Consider alternative interface for affinity_query comparisons

In P0796r2 affinity queries are performed by using the comparison operators > and < between two affintiy_query objects returning an expected<size_t, error_type> representing the magnitude of the difference in the two properties. However, some feedback has pointed out that this approach is problematic as it means that the <= and >= operators cannot consistently be supported.

We should consider alternative approaches which allow for a more typical use of the equality operators but still meet the existing requirements.

CP001: Missing SYCL memset capabilities

Adding memset to SYCL by using https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueFillBuffer.html