agency-library / agency Goto Github PK

View Code? Open in Web Editor NEW

155.0 13.0 27.0 3.84 MB

Execution primitives for C++

Home Page: http://agency-library.github.io

License: BSD 3-Clause "New" or "Revised" License

C++ 89.39% Cuda 9.99% Python 0.62%

cpp execution threading thread-pool executor parallelism cuda concurrency

agency's Introduction

What is Agency?

Agency is an experimental C++ template library for parallel programming. Unlike higher-level parallel algorithms libraries like Thrust, Agency provides lower-level primitives for creating execution. Agency interoperates with standard components like execution policies and executors to enable the creation of portable parallel algorithms.

Examples

Agency is best-explained through examples. The following program implements a parallel sum.

#include <agency/agency.hpp>
#include <agency/experimental.hpp>
#include <vector>
#include <numeric>
#include <iostream>
#include <cassert>

int parallel_sum(int* data, int n)
{
  // create a view of the input
  agency::experimental::span<int> input(data, n);

  // divide the input into 8 tiles
  int num_agents = 8;
  auto tiles = agency::experimental::tile_evenly(input, num_agents);

  // create 8 agents to sum each tile in parallel
  auto partial_sums = agency::bulk_invoke(agency::par(num_agents), [=](agency::parallel_agent& self)
  {
    // get this parallel agent's tile
    auto this_tile = tiles[self.index()];

    // return the sum of this tile
    return std::accumulate(this_tile.begin(), this_tile.end(), 0);
  });

  // return the sum of partial sums
  return std::accumulate(partial_sums.begin(), partial_sums.end(), 0);
}

int main()
{
  // create a large vector filled with 1s
  std::vector<int> vec(32 << 20, 1);

  int sum = parallel_sum(vec.data(), vec.size());

  std::cout << "sum is " << sum << std::endl;

  assert(sum == vec.size());

  return 0;
}

This code example implements a vector sum operation and executes it sequentially, in parallel, in parallel on a single GPU, and finally multiple GPUs:

#include <agency/agency.hpp>
#include <agency/cuda.hpp>
#include <vector>
#include <cassert>
#include <iostream>
#include <algorithm>

int main()
{
  using namespace agency;

  // allocate data in GPU memory
  using vector = std::vector<float, cuda::managed_allocator<float>>;

  size_t n = 1 << 20;
  float a = 13;
  vector x(n, 1);
  vector y(n, 2);
  vector z(n, 0);

  vector reference(n, 13 * 1 + 2);

  float* x_ptr = x.data();
  float* y_ptr = y.data();
  float* z_ptr = z.data();


  // execute sequentially in the current thread
  bulk_invoke(seq(n), [=](sequenced_agent& self)
  {
    int i = self.index();
    z_ptr[i] = a * x_ptr[i] + y_ptr[i];
  });

  assert(z == reference);
  std::fill(z.begin(), z.end(), 0);


  // execute in parallel on the CPU
  bulk_invoke(par(n), [=](parallel_agent& self)
  {
    int i = self.index();
    z_ptr[i] = a * x_ptr[i] + y_ptr[i];
  });

  assert(z == reference);
  std::fill(z.begin(), z.end(), 0);


  // execute in parallel on a GPU
  cuda::grid_executor gpu;
  bulk_invoke(par(n).on(gpu), [=] __device__ (parallel_agent& self)
  {
    int i = self.index();
    z_ptr[i] = a * x_ptr[i] + y_ptr[i];
  });

  assert(z == reference);
  std::fill(z.begin(), z.end(), 0);
  

  // execute in parallel on all GPUs in the system
  cuda::multidevice_executor all_gpus;
  bulk_invoke(par(n).on(all_gpus), [=] __device__ (parallel_agent& self)
  {
    int i = self.index();
    z_ptr[i] = a * x_ptr[i] + y_ptr[i];
  });

  assert(z == reference);
  std::fill(z.begin(), z.end(), 0);


  std::cout << "OK" << std::endl;
  return 0;
}

Discover the Library

Refer to Agency's Quick Start Guide for further information and examples.
See Agency in action in the collection of example programs.
Browse Agency's online API documentation.

Agency is an NVIDIA Research project.

agency's People

Contributors

Stargazers

Watchers

agency's Issues

Consider introducing then_execute

It seems important for executors to be able to create asynchronous continuations dependent on a predecessor task in the style of future::then(). It is unclear if a hypothetical executor_traits::then_execute() function, whose execution would depend on the completion of a preceding task, would be more basic than executor_traits::async_execute(). At first glance, it seems difficult to implement then_execute() generically in terms of async_execute() because the executor needs to be directly involved in the details of scheduling a dependent. However, async_execute() could be implemented in terms of then_execute() if executor_traits provided a way to create an immediately ready future. The drawback of requiring then_execute() to be a basic operation is that it would burden all executor authors with implementing dependent scheduling.

Consider renaming executor functions

Should distinguish between the extra parameter marshaling that bulk_invoke & bulk_async do (i.e., they interoperate with share()) and the primitive functionality of the executor functions.

Also, since the noun is executor, the verb should be something incorporating "execute".

Some ideas:

execute & async_execute
bulk_execute & bulk_async_execute

Think I'd prefer ditching the bulk prefix for the executor functions because we can distinguish bulk execution from simple execution via overloads -- the simple overloads don't require a shape parameter.

Consider adding executor_traits::when_all_execute

The most general form would look something like this:

template<size_t... Indices, class TupleOfFutures, class Function, class... Types>
future<tuple<...>> when_all_execute(executor_type& ex, TupleOfFutures&& futures, Function f, shape_type shape, Types&&... shared_inits);

The indices allow the caller to select which futures to forward to the result. The other futures get consumed by the operation.

With when_all_execute (and the operation which waits on a future), we should be able to implement any other executor_traits operation:

// when_all()
template<size_t... Indices, class TupleOfFutures>
future<tuple<...>> when_all(executor_type& ex, TupleOfFutures&& futures)
{
  using traits = agency::executor_traits<executor_type>;
  return traits::when_all<Indices...>(ex, std::forward<TupleOfFutures>(futures), [](auto&...){});
}

// when_all() which returns everything
template<class TupleOfFutures>
future<tuple<...>> when_all(executor_type& ex, TupleOfFutures&& futures)
{
  using traits = agency::executor_traits<executor_type>;
  return traits::when_all<0,1,2,...>(ex, std::forward<TupleOfFutures>(futures));
}

// default singleton when_all_execute() for native bulk executors
template<size_t... Indices, class TupleOfFutures, class Function, class... Types>
future<tuple<...>> when_all_execute(executor_type& ex, TupleOfFutures&& futures, Function f)
{
  using traits = agency::executor_traits<executor_type>;
  return when_all<Indices...>(ex, std::forward<TupleOfFutures>(futures), [=](auto&... past_values, index_type idx)
  {
    f(past_values...);
  },
  detail::shape_cast<shape_type>(1));
}

// default bulk when_all_execute() for native singleton executors
template<size_t... Indices, class TupleOfFutures, class Function>
future<tuple<...>> when_all_execute(executor_type& ex, TupleOfFutures&& futures, Function f, shape_type shape)
{
  using traits = agency::executor_traits<executor_type>;
  return when_all<Indices...>(ex, std::forward<TupleOfFutures>(futures), [](auto&... past_values)
  {
    for(index_type idx = 0; idx < shape_type; ++idx)
    {
      f(past_values..., idx);
    }
  });
}

// default singleton then_execute()
template<class Executor, class T, class Function>
future<result_of_t<Function(T)>>
  then_execute(Executor& ex, future<T>& past, Function f)
{
  using traits = agency::executor_traits<Executor>;
  using result_type = result_of_t<Function(T)>;

  // XXX there needs to be a make_ready_future which accepts an allocator so we can leave the
  //     result uninitialized
  auto result = traits::make_ready_future<result_type>(ex);

  return traits::when_all<0>(ex, result, past, [=](result_type& result, T& past_value)
  {
    // XXX this assignment would actually be a placement new
    result = f(past_value);
  }
}

// bulk then_execute<void>:
template<class Executor, class Future, class Function>
static agency::executor_traits<Executor>::template future<void>
  then_execute(Executor& ex, Future& past, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;

  return traits::when_all_execute_and_select<>(ex, past, make_tuple(move(f)), shape);
  );
}

// bulk then_execute<Container>:
template<class Container, class Executor, class Future, class Function>
static agency::executor_traits<Executor>::template future<Container>
  then_execute(Executor& ex, Future& past, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;
  auto results = traits::template make_ready_future<Container>(ex, shape);

  auto results_and_past = make_tuple(move(results), move(past));
  return traits::template when_all_execute_and_selectl<0>(ex, move(results_and_past), [=](Container& results, auto& past, index_type idx)
  {
    results[idx] = f(past, idx);
  });
}

// bulk async_execute<Container>
template<class Container, class Executor, class Function>
static agency::executor_traits<Executor>::template future<Container>
  async_execute(Executor& ex, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;
  auto immediately_ready = traits::make_ready_future<void>(ex);
  return traits::then_execute<Container>(ex, immediately_ready, f, shape);
}

// bulk async_execute<void>
template<class Executor, class Function>
static agency::executor_traits<Executor>::template future<void>
  async_execute(Executor& ex, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;
  auto immediately_ready = traits::make_ready_future<void>(ex);
  return traits::then_execute(ex, immediately_ready, f, shape);
}

// bulk execute<Container>
template<class Container, class Executor, class Function>
static agency::executor_traits<Executor>::template future<Container>
  execute(Executor& ex, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;
  return traits::async_execute<Container>(ex, f, shape).get();
}

// bulk execute<void>
template<class Executor, class Function>
static agency::executor_traits<Executor>::template future<void>
  execute(Executor& ex, Function f, shape_type shape)
{
  using traits = agency::executor_traits<Executor>;
  return traits::async_execute(ex, f, shape).get();
}

make_ready_future needs a form which takes an allocator

We need to be able to customize the allocation of the ready value via allocator.

For example, sometimes the storage for the object is ready, but the value itself isn't -- we need to placement new the object later. We can handle this with an uninitializing allocator.

It makes sense, because std::promise's constructor has a form which takes an allocator, so this "immediate promise" should as well for symmetry.

is_executor doesn't work for nested executors

Needs to check for the right number of shared arguments.

std::barrier implementation is broken

eliminate executor_traits::shared_param_type

Polymorphic lambdas make this typedef unnecessary and its presence makes it difficult to change how executors marshal shared parameters to the lambda.

Consider making executor::bulk_invoke variadic

Requiring the shared parameters to come packed in a tuple is fairly complicated to support because reasoning about tuples of references is difficult. It might be simpler if it took a shared parameter per level of hierarchy. It would also eliminate the need for executor_traits::shared_param_type.

Implement cuda::detail::make_ready_future<T>

Implementation should probably reuse cuda::detail::make_unique.

Replace discard_value() with general move constructor

Instead of a special member function to turn future<T> into future<void>, we should just provide a move constructor template. Implementations can provide specializations for conversions which can be accelerated, such as T -> void or potentially Derived -> Base.

future_cast would first look for a converting move constructor. If none is found, it can create a continuation.

Add is_executor

Would check for execution_category, and the asynchronous execution member function.

Implement nested_executor::bulk_invoke

Executor adaptors like nested_executor should be sure to implement bulk_invoke to get the most efficient implementation. Right now we get bulk_async().wait() which is not efficient when adapting sequential_executor.

Consider adding executor_traits::when_all

then_execute can only take a single dependency, but we often need to write tasks that have many dependencies. There needs to be a simple way for executor users to introduce a join point that is also visible to executors.

One could in principle implement when_all by doing a bulk then_execute and having agents individually wait on futures. However, we should avoid solutions which require agents to work at the level of futures.

Consider making decay_construct_result take reference types

decay_construct_result requires the user to first decay the template parameter in perfect forwarding cases. Since these cases are frequent, it might be better to do the decay within the specializations of decay_construct_result.

Figure out how to do deferred construction of shared parameters

The scope of the basic problem is described here.

For now, we can have bulk_invoke check for specially annotated shared parameters and construct them using the constructor parameters inside, and then copy the value of the parameter to the lower-level executor. This way the executor doesn't have to worry about any of this mess and it's all localized in the implementation of agency::bulk_invoke.

Here's a sketch of the placeholder implementation, which documents the disadvantages:

#include <agency/sequential_executor.hpp>
#include <agency/detail/tuple.hpp>
#include <iostream>

template<class T, class... Args>
struct share_result_t1
{
  __AGENCY_ANNOTATION
  T make() const
  {
    return __tu::make_from_tuple<T>(args_);
  }

  agency::detail::tuple<Args...> args_;
};

template<class T>
share_result_t1<T,T> share1(const T& val)
{
  return share_result_t1<T,T>{std::make_tuple(val)};
}

template<class Function, class ShareResult>
void bulk_invoke1(Function f, ShareResult share_me)
{
  auto exec = agency::sequential_executor{};

  // explicitly construct the shared parameter
  // it gets copy constructed by the executor
  // XXX problems with this approach
  //     1. the type of the shared parameter is constructed twice
  //     2. requires the type of the shared parameter to be copy constructable
  //     3. won't be able to support concurrent construction
  auto constructed_shared_arg = share_me.make();

  exec.bulk_invoke([=](size_t idx, int& shared_arg)
  {
    f(7, shared_arg);
  },
  1,
  constructed_shared_arg);
}

int main()
{
  auto lambda = [](int mine, int& shared)
  {
    std::cout << "mine: " << mine << std::endl;
    std::cout << "shared: " << shared << std::endl;
  };

  bulk_invoke1(lambda, share1(13));

  return 0;
}

Optimize tuple_find_non_null

This function requires several seconds of compile time.

We should be able to speed it up by replacing it with something like this:

auto tuple_get_non_null(const tuple<Types...>& t)
{
  using type = // figure out which element of the tuple is not null and its type
  return get<type>(t);
}

The current implementation uses tuple_filter_invoke and I think we pay for the unused generality.

Simplify execution agent shared parameter marshaling

Having to sweat whether or not execution_agent_traits::make_shared_initializer returns a tuple vs a scalar is irritating.

This is how it should work.

If an execution agent has a shared parameter, it defines the nested type shared_param_type which is constructible from its param_type. We already do this. The execution agent does not define make_shared_initializer or anything else.

Rename make_shared_initializer to make_shared_param_tuple or similar.

execution_agent_traits::make_shared_param_tuple checks for the existence of shared_param_type. If it exists, it sticks an element of this type in the appropriate slot in the tuple which it returns. If it does not exist, it sticks ignore.

This way, execution_agent_traits always returns a tuple we can manipulate uniformly. This should really simplify execution_group.

Consider introducing future_traits::discard_value()

Sometimes we are not interested in the value produced from a task, and are only interested in the task's completion. In these cases, what we want is a future<void>. The problem is that there is no way to produce a future<void> without introducing an empty continuation. In these cases, we really just want to discard the original future's value. Many types of futures should be able to implement this cheaply without interacting with an executor.

To intercept this use case, we can introduce the function

future_traits<Future>::template rebind<void> future_traits<Future>::discard_value(future_type& fut)

If the expression fut.discard_value() is well-formed, then we get a void future corresponding to fut's completion, and fut is invalid after the call. Otherwise, this function does not exist.

In the cases where this function does not exist, the client needs to interact with an executor using the general purpose future_cast function.

Consider adding a primitive executor_traits operation corresponding to future::get

The other two primitive operations, make_ready_future and then_execute are members of executor_traits. We may as well introduce the final primitive operation corresponding to std::future::get:

template<class Future>
typename future_traits<Future>::value_type get(executor_type& ex, Future& fut);

This would synchronize and return the value associated with a given future. The operation would also invalidate fut. The default implementation would first try future_traits<Future>::get(fut).

We'll want to choose a different name.

We might also want to optionally support a non-invalidating "wait, but don't get" operation similar to std::future::wait. Dunno.

Pass then_execute's predecessor value as shared parameter to continuation

then_execute needs to pass the value (if any) in its predecessor future as a "highest-level" shared parameter to the invocation of the continuation.

Consider introducing a future_cast function

If executors are able to define their own future types, there needs to be some way for them to interoperate. One way would be to allow casting between different types of futures.

Sometimes (like when writing glue code in #50), we need to cast the value type of a future from future<A> -> future<B>:

A foo();

B bar()
{
  static_cast<B>(foo());
}

future<A> async_foo();

future<B> async_bar()
{
  return future_cast<B>(async_foo());
}

Other times an API may require a your_future<T> when what we have is a my_future<T>. It seems like it ought to be possible to cast generically between different types of futures by nesting a .wait() inside an asynchronous call.

The general implementation might look like this:

template<T, class Executor, class Future>
executor_traits<Executor>::template future<T>
  future_cast(Executor& ex, Future& from)
{
  return executor_traits<Executor>::async_execute(ex, [from{move(from)}]
  {
    // .get() blocks this task
    return static_cast<T>(from.get());
  });
}

When Future is an instance of Executor::future, it can be optimized this way:

template<T1,class Executor,T2>
executor_traits<Executor>::template future<T1>
  future_cast(Executor& ex, executor_traits<Executor>::template future<T2>& from)
{
  return executor_traits<Executor>::then_execute(ex, [](auto& from_value)
  {
    // this task doesn't block
    return static_cast<T1>(move(from_value));
  });
}

Finally, when Future is an instance of Executor::future and T1 is void, we can call future_traits<Future>::discard_value(fut), if the call is well-formed. This completely avoids any interaction with the executor.

Eliminate superfluous nested_executor::async_execute

executor_traits should implement this on our behalf.

Figure out parameter order of then_execute

then_execute has three forms: a single-agent form, a multi-agent form, and a multi-agent form with shared parameters. All three of them take a single future as a dependency.

The single-agent form calls a function given the future's value as a parameter.
The multi-agent form calls a function given an agent index and the future's value as parameters.
The multi-agent with shared parameters form calls a function given an agent index, the future's value, and the shared parameters as parameters.

How should we order then_execute's parameters?

Should probably be:

then_execute(executor_type& ex, Future& fut, Function f);
then_execute(executor_type& ex, Future& fut, Function f, shape_type shape);
then_execute(executor_type& ex, Future& fut, Function f, shape_type shape, Types&&... shared_parameters);

The corresponding signatures of f should be:

result_type f(typename future_traits<Future>::value_type& past_arg);
result_type f(typename future_traits<Future>::value_type& past_arg, index_type idx);
result_type f(typename future_traits<Future>::value_type& past_arg, index_type idx, Types&... shared_parameters);

One alternative would be to swap the positions of fut & f.

rename nested_* to scoped_*

The nested execution/executor stuff is identical in intention to std::scoped_allocator_adaptor. We should use consistent nomenclature.

Consider adding agency::executor_arg protocol

Executors could pass themselves as a parameter to the user functions they invoke by detecting when the second parameter is preceded by a agency::executor_arg_t parameter. This would mirror the protocol used by allocators.

We could introduce executor_traits::invoke (mirroring allocator_traits::construct) to facilitate this process.

The difference between executor_traits::invoke and allocator_traits::construct is that it would be incumbent on executors to use executor_traits::invoke. By contrast, allocator clients are the users of allocator_traits::construct, which gets called after allocator_traits::allocate. executor_traits::invoke would be called by the executor from within a call to e.g. allocator_traits::execute.

Implement is_executor

Start simple - should check for execution_category and async_execute.

Consider making execution_agent_traits::execute variadic

Function would come last instead of first, and then variadic argument list.

Would really simplify the implementation of bulk_invoke.

cuda::grid_executor does not compute the correct global function pointer used in all cases

basic_grid_executor::global_function_pointer is parameterized on the type of Function used to instantiate the kernel inside of basic_grid_executor::bulk_async(Function, shape_type). Instead, it needs to be parameterized on the Function and an optional Tuple template parameter.

Likewise, grid_executor::max_shape should take the type of Args which will be passed to the Function.

With this interface it's going to be fairly difficult to ensure that the thing returned by global_function_pointer is the actual kernel that will be used to implement a bulk_async with the given types.

Figure out if concepts enable ideal when_all_execute_and_select syntax

when_all_execute_and_select logically requires two parameter packs:

template<size_t... SelectedIndices, class Function, class TupleOfFutures, class... Types>
future<...> when_all_execute_and_select(Function f, TupleOfFutures&& futures, Types&&... shared_parameters);

However, C++14 only permits a single parameter pack.

With Concepts (and once we take Allocators instead of general T for the shared parameters), we can potentially distinguish between the two parameter packs because the first pack of types must be Futures, and the second pack of types must be Allocators:

template<size_t... SelectedIndices, class Function, Future... Futures, Allocator... Allocators>
future<...> when_all_execute_and_select(Function f, Futures&... futures, Allocators&... allocators);

We should figure out if this will be possible.

Consider giving make_ready_future an overload taking constructor args

It would be handy if make_ready_future could take a list of constructor parameters and emplace the object inside the implementation.

nested_executor::bulk_async mistakenly captures inner shared argument by reference

When nested_executor::bulk_async splits the shared argument tuple into its head and tail, it creates a tuple of references for the tail.

When this tuple is captured by the lambda, these references are invalid.

When we split the tuple, we should create values instead of references. The most efficient way to do this is to move the head and move the tail to elide copies.

If bulk_async were variadic in shared arguments, this error would not have occurred, since the tuple would not have had to have been split.

recursive tuple implementation is inefficient

Introduce agency::decay_construct

This is like decay_copy, except that it can construct a type different from its argument. Executors use this to turn their shared_init parameter into a parameter the lambda will consume. (cf. the way std::async creates its parameters via decay_copy)

To make all this work, executors should receive their shared initializers via forwarding reference (just like how std::async & std::thread receive their function parameters).

Put this function in the parameter.hpp header. Also think up a better name for parameter.hpp. utility.hpp might be a better name. Would be nice to have all this live in bulk_invoke.hpp but I think the dependencies become circular with executors. Maybe put it in functional.hpp because that's where they propose to put std::invoke.

Also need a general way to name the type of result decay_construct (and its overloads) produces.

Relationship of agency with Thrust / Parallelism TS

I've read the description, looked at examples, but honestly don't understand: what does Agency do that cannot be done with Thrust / Parallelism-TS? Can you please elaborate?

My current, obviously incorrect, understanding is that it does basically the same thing, using slightly different syntax, described by slightly different jargon.

What can be done with bulk_invoke that cannot be done with for_each_n?
What are non-trivial differences between an execution policy and an executor(+additional state related to the parallel decomposition)?

Introduce executor adaptor which adapts executor shape/indices

It would be a lot simpler for execution policies to demand that the executor index match the agent index than to do the index & shape casting operation within bulk_invoke each time.

Rename regular_grid to lattice

bulk_invoke without shared argument on vector_executor is inefficient

We should probably reconsider how executor_traits lowers the various calls onto the executor when there are missing member functions.

Right now, bulk_invoke(f,n) on vector_executor goes through bulk_async, which creates a future and immediately destroys it.

Introduce executor_array

Implement future_traits::then()

The implementation of single_agent_then_execute should use it.

Consider making nested_execution_tag variadic

Would look like this:

template<class ExecutionCategory1, class ExecutionCategory2, class... ExecutionCategories>
struct nested_execution_tag
{
  using outer_execution_category = ExecutionCategory1;
  using inner_execution_category = nested_execution_tag<ExecutionCategory2, ExecutionCategories...>;
};

template<class ExecutionCategory1, class ExecutionCategory2>
struct nested_execution_tag<ExecutionCategory1,ExecutionCategory2>
{
  using outer_execution_category = ExecutionCategory1;
  using inner_execution_category = ExecutionCategory2;
};

I believe this is how std::scoped_allocator_adaptor works.

It we made this change, we ought to make a similar change to nested_executor.

implement task_region as an example

Avoid using named functors in nested_executor & flattened_executor

The code is so confusing. We should be able to deduce the type of the lambda parameters without resorting to functors.

Consider having executor::bulk_invoke receive shared_arg by forwarding reference

The implementation of bulk_invoke/bulk_async would need to make local copies of the shared parameter via decay_copy just like std::async & std::thread do. This would be an extra step for implementors of simple executors such as sequential_executor.

function_with_shared_arguments in grid_executor.hpp needs simplification

This template is implemented via a bunch of specializations. It needs to be broken down into orthogonal components. It's getting too hard to modify as-is.

Consider returning a container of results from bulk_invoke & bulk_async

If bulk_invoke returned a container collecting the result of each invocation, algorithms like reduce would be really easy to write:

template<class Iterator, class T, class BinaryFunction>
T reduce(Iterator first, Iterator last, T init, BinaryFunction binary_op)
{
  using namespace agency;
  auto n = std::distance(first, last);

  auto partial_sums = bulk_invoke(par, [=](parallel_agent& g)
  {
    auto partition_size = n / g.size();
    auto i = g.index();

    auto partition_begin = first + partition_size * i;
    auto partition_end   = std::min(last, partition_begin + partition_size);

    return reduce(seq, partition_begin + 1, partition_end, *partition_begin, binary_op);
  });

  return reduce(seq, partial_sums.begin(), partial_sums.end(), init, binary_op);
}

The invoker wouldn't need to do any partitioning or sizing at all if it didn't want to.

Nested invocations would return nested containers, but sometimes that wouldn't be helpful. In these cases, we'd need a way for the lambda to indicate that its invocation group returns a single result, and a way to indicate which result is required.

Singleton form of then_execute should default to future.then()

If no executor member function named then_execute exists, the default implementation should be to try future.then().

single-agent executor operations should receive universal references to functions

We should enable move-only functions to execute when executing a single agent.

Introduce this_thread::vector_executor & this_thread::parallel_executor

Sometimes you want unsequenced semantics, but want to restrict execution to the current thread.

executor_array would use this_thread::parallel_executor to loop through an array of executors calling async_execute without introducing dependencies between each call. Even though the loop would just be a for loop, the execution of each executor would be parallel.

Another use case for this_thread::parallel_executor would be when you want to restrict execution to the current thread, but the order of execution is meaningless.

Agent indices might be backwards

Agency currently puts together agent indices by concatenating the outer indices before the inner indices.

By analogy to a doubly-nested for loop, it looks like this:

int rows = 3;
int cols = 3;

for(int i = 0; i < rows; ++i)
{
  for(int j = 0; j < cols; ++j)
  {
    std::cout << agency::int2{i,j} << std::endl;
  }
}

This produces the sequence

{0, 0}
{0, 1}
{0, 2}
{1, 0}
{1, 1}
{1, 2}
{2, 0}
{2, 1}
{2, 2}

Note this sequence is in lexicographic order.

That's fine, but if we were to compute the ranks of these indices using Agency's current rank computation, those ranks would not correspond to the index's position in the sequence.

Also, it's strange that the faster-changing dimension ('j') comes after the slower-changing dimension ('i'). If I had chosen the names 'y' & 'x' instead of 'i' & 'j', the ordering would seem backwards.

It might be a better idea to put the logical outer loop indices last and the inner indices first. This makes sense because the inner indices change faster than the outer indices.

int rows = 3;
int cols = 3;

for(int i = 0; i < rows; ++i)
{
  for(int j = 0; j < cols; ++j)
  {
    // note the order of i & j is now swapped
    std::cout << agency::int2{j,i} << std::endl;
  }
}

This produces the sequence

{0, 0}
{1, 0}
{2, 0}
{0, 1}
{1, 1}
{2, 1}
{0, 2}
{1, 2}
{2, 2}

It we don't make this change, then we should change the way the operator< is computed for Agency indices to match the order of their ranks. This would be weird though, because it wouldn't match the ordering produced by the underlying tuple (or array, or vector, or whatever) type. Also, it wouldn't match the order in which the agents are logically produced. For sequential executors, I think might be valuable to guarantee that agents are executed in lexicographical order of their indices.

The alternative would be to change the rank computation.

Consider moving bulk_invoke & bulk_async into their own header

They don't really require any of the definitions found inside execution_policy.hpp, and we may want overloads of them for bare executors.