From commits: <a class="commit-link" data-hovercard-type="comm

Some more readings: A Unified Scheduler for Rec

The generated code of the following nested loop: <div class="highlight highlight-s

Some more reflexion on nestable barriers (<a class="issue-link js-issue-link" data-err

More support for abandoning full nestable barriers: <a href="http://www.open-std.org/j

Slides of Data-Driven Futures: <a href="http://cnc11.hpcgarage.org/slides/Tasirlar.pdf

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="54

Nested for-loops: Expressing reads-after-writes and writes-after-writes dependencies about weave HOT 8 CLOSED

mratsim commented on August 17, 2024

Nested for-loops: Expressing reads-after-writes and writes-after-writes dependencies

from weave.

Comments (8)

mratsim commented on August 17, 2024

Some more readings:

A Unified Scheduler for Recursive and TaskDataflow Parallelism

http://www.cs.qub.ac.uk/~H.Vandierendonck/papers/PACT11_unified.pdf
Nabbit, Executing Dynamic Task Graph Using Work-Stealing

https://pdfs.semanticscholar.org/7c1d/5701d6f8f1a311b99e96f25c501ae70e7b58.pdf
Combining Dataflow programming with Polyhedral Optimization

https://hal.archives-ouvertes.fr/hal-01572439/document

Also it seems like Cholesky decomposition would be a good candidate to test dataflow dependencies

It is cited in all dataflow papers i've read
Lots of references
Can nest optimized GEMM
Can be reused as a baseline for the Lux compiler in laser

from weave.

mratsim commented on August 17, 2024

The generated code of the following nested loop:

    proc main4() =
      init(Weave)

      expandMacros:
        parallelForStrided i in 0 ..< 100, stride = 30:
          parallelForStrided j in 0 ..< 200, stride = 60:
            captures: {i}
            log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

      exit(Weave)

    echo "\n\nStrided Nested loops"
    echo "-------------------------"
    main4()
    echo "-------------------------"

is:

proc parallelForSection() {.nimcall, gcsafe.} =
  let this`gensym453091 = localCtx.worker.currentTask
  var i = this`gensym453091.start
  inc this`gensym453091.cur, 1
  while i < this`gensym453091.stop:
    type
      CapturedTy = (typeof(i))
    discard
    proc parallelForSection(i: typeof(i)) {.nimcall, gcsafe.} =
      let this`gensym453805 = localCtx.worker.currentTask
      var j = this`gensym453805.start
      inc this`gensym453805.cur, 1
      while j < this`gensym453805.stop:
        discard "captures:\n  {i}"
        c_printf("Matrix[%d, %d] (thread %d)\n", i, j, localCtx.worker.ID)
        flushFile(stdout)
        j += this`gensym453805.stride
        this`gensym453805.cur += this`gensym453805.stride
        loadBalance(Weave)

    proc weaveParallelFor(param`gensym453408: pointer) {.nimcall, gcsafe.} =
      let this`gensym453409 = localCtx.worker.currentTask
      const
        expr`gensym454268 = "not isRootTask(this`gensym453409)"
      let data = cast[ptr CapturedTy](param`gensym453408)
      parallelForSection(data[])

    let task`gensym453410 = newTaskFromCache()
    task`gensym453410.parent = localCtx.worker.currentTask
    task`gensym453410.fn = weaveParallelFor
    task`gensym453410.isLoop = true
    task`gensym453410.start = 0
    task`gensym453410.cur = 0
    task`gensym453410.stop = 200
    task`gensym453410.stride = 60
    cast[ptr CapturedTy](addr(task`gensym453410.data))[] = i
    schedule(task`gensym453410)
    i += this`gensym453091.stride
    this`gensym453091.cur += this`gensym453091.stride
    loadBalance(Weave)

proc weaveParallelFor(param`gensym453086: pointer) {.nimcall, gcsafe.} =
  let this`gensym453087 = localCtx.worker.currentTask
  const
    expr`gensym454834 = "not isRootTask(this`gensym453087)"
  parallelForSection()

let task`gensym453088 = newTaskFromCache()
task`gensym453088.parent = localCtx.worker.currentTask
task`gensym453088.fn = weaveParallelFor
task`gensym453088.isLoop = true
task`gensym453088.start = 0
task`gensym453088.cur = 0
task`gensym453088.stop = 100
task`gensym453088.stride = 30
schedule(task`gensym453088)

It should be possible to create a waitable dummy future for each iteration of the outer loop and sync on it in the inner loop.

from weave.

mratsim commented on August 17, 2024

Some more reflexion on nestable barriers (#41).

It doesn't seem possible in the current design or compatible with work-stealing in general.
What can be done is something similar to #omp taskwait or Cilk sync which is to await for all child task to be finished.
However a sync barrier for all threads in a nested context cannot be done because there is no guarantee that all threads will actually get the task as they are lazily splitted, so that will lead to a deadlock.

A taskgraph approach is possible, however lots of care must be given to reduce the overhead.
Note that the main motivation behind nestable barriers, being able to implement state-of-the-art GEMM is partially solved by #66.

However in #66 the following constructs prevent weave-based GEMM from being nestable to build an im2col + GEMM based convolution parallelized at both the batch level and the GEMM level.

weave/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim

Lines 152 to 182 in 789fa95

 # 2. for pc = 0,...,k−1 in steps of kc 

 for pc in countup(0, K-1, tiles.kc): 

 prefetch(tiles.b, Write, LowTemporalLocality) 

 let kc = min(K - pc, tiles.kc) # Deal with edges # A[0:M, pc:pc+kc] 

 let kcncB = vB.stride(pc, 0) # B[pc:pc+kc, jc:jc+nc] 

 pack_B_kc_nc[T, ukernel](tiles.b, kc, nc, kcncB) # PackB panel [kc, nc] (nc is large or unknown) 

 # First time writing to C, we scale it, otherwise accumulate 

 let beta = if pc == 0: beta else: 1.T 

 sync(Weave) # TODO: this cannot be nested 

 # #################################### 

 # 3. for ic = 0,...,m−1 in steps of mc 

 parallelFor icb in 0 ..< tiles.ic_num_tasks: 

 captures: {pc, tiles, nc, kc, alpha, beta, vA, vC, M} 

 let packA = tiles.a + icb * tiles.upanelA_size 

 prefetch(packA, Write, LowTemporalLocality) 

 let ic = icb * tiles.mc 

 let mc = min(M-ic, tiles.mc) # C[ic:ic+mc, jc:jc+nc] 

 let mckcA = vA.stride(ic, pc) # A[ic:ic+mc, pc:pc+kc] 

 pack_A_mc_kc[T, ukernel](packA, mc, kc, mckcA) # PackA block [mc, kc] 

 gebp_mkernel[T, ukernel]( # GEBP macrokernel: 

 mc, nc, kc, # C[ic:ic+mc, jc:jc+nc] = 

 alpha, packA, tiles.b, # αA[ic:ic+mc, pc:pc+kc] * B[pc:pc+kc, jc:jc+nc] + 

 beta, vC.stride(ic, 0) # βC[ic:ic+mc, jc:jc+nc] 

 ) 

 sync(Weave) # TODO: this cannot be nested

from weave.

mratsim commented on August 17, 2024

More support for abandoning full nestable barriers: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3409.pdf and use the Cilk "fully strict" semantics "only await the children spawns" so that the barrier is composable.

That does not remove the need for task dependency graphs (implemented as message passing!)

from weave.

mratsim commented on August 17, 2024

X10 introduces the "terminally-strict where a task can await any of its descendants and not just its direct children. Weave can actually do that. One interesting thing is that X10 decouples function scope with sync/awaitable scope. See paper: https://www.semanticscholar.org/paper/Work-First-and-Help-First-Scheduling-Policies-for-Barik-Guo/2cb7ac2c3487aae8302cd205446710ea30f9c016

Closely-related to our need the following paper presents Data-Driven Futures to express complex dependencies. This is equivalent to "andThen" continuations in async/await https://www.cs.rice.edu/~vs3/PDF/DDT-ICPP2011.pdf. The model seems compatible with channel-based work-stealing. From the paper their Cholesky decomposition was better performing than the publication time state-of-the-art which is interesting as Cholesky is array-based.

from weave.

mratsim commented on August 17, 2024

Slides of Data-Driven Futures: http://cnc11.hpcgarage.org/slides/Tasirlar.pdf

Cholesky with Data-Driven Futures

This can be benched against either LAPACK Cholesky or a lower-level OpenMP task dependency based Cholesky from http://www.netlib.org/utk/people/JackDongarra/PAPERS/task-based-cholesky.pdf

Here is LU Decomposition and Cholesky Decomposition in pseudo code and dependency graph from http://www.cs.ucy.ac.cy/dfmworkshop/wp-content/uploads/2014/07/Exploring-HPC-Parallelism-with-Data-Driven-Multithreating.pdf

Side thoughts

Stream Pipeline and graph parallelism

Data-driven futures seem suitable to express pipeline and graph parallelism

Heterogeneous computing

Besides the distributed computing scenario another thing to keep in my mind is if that helps or hinders heterogeneous (distributed?) CPU + GPU computing

from weave.

mratsim commented on August 17, 2024

#86 introduces awaitable parallel-loops.

However this is not sufficient to allow nesting of matrix multiplications in parallel regions (for example in a batched matrix multiplication or convolution or LU/QR/Cholesky factorization)

So we really need to be able to express data-dependences (while currently we express control-flow dependencies via spawn/sync).

Research

Looking into other runtimes (TBB Flowgraph, XKaapi/libkomp, Cpp-Taskflow, Nabbit, ...) most use a concurrent hash table which doesn't map well with message passing, distributed computing or GPU computing.

StarPU (nice high level overview, histoy and challenges here On Runtime Systems for Task-based Programming on Heterogeneous Platforms might provide a way forward.

More Cholesky representations from the StarPU paper:

Ideas

We could add the support of delayed tasks and promises, i.e. each worker has a separate queue of delayed tasks, they are added to the normal work-stealing scheduler only when a promised is fulfilled.
Promises can be implemented as a channel + a delivered/fulfilled boolean (as a channel read consumes a message). They probably need to be threadsafe ref-counted object as a thread may allocate a promise, declare a deferred task and return and threads working on the promise and then triggering the declared task is not the same.

For example

type Promise = ptr object
  fulfiller: Channel[Dummy]
  refCount: Atomic[int32]
  isDelivered: Atomic[bool]

A task dependent on a promise would check regularly if it was delivered and if not query the fulfiller channel. If query is successful, isDelivered is changed to true (so that multiple tasks can dependent on the same promise) and the task is scheduled.

Pending issues:

For-loop dependencies

Supporting for-loop might requires creating a matrix of 1000x1000 promises if we want to express dependency at the cell-level for a 1000x1000 matrix. Obviously this is a worst-case and:

tiling considerably reduces the number of promises, tiling by 32 on 2 dimensions requires 1000/32 * 1000/32 = 1024 promises
the memory subsystem can handle trillions of tasks

Nonetheless, there might be more efficient data-structure like interval trees or segment trees. The channel could be replaced by a MPSC channel a consumer would check the interval data structure to see if its dependency is fulfilled, if not it takes a lock, drain the channel and update the interval data structure with all received chunks.
This unfortunately would add another synchronization data structure beyond channels and event notifiers.

In distributed runtimes

Distributed runtimes are the most relevant here as they have to use message-passing across a cluster.

Kokkos seems to use an array of dependencies:
https://www.osti.gov/servlets/purl/1562647

StarPU and HPX to be checked.

Delayed task distribution

If we add support for delayed tasks, a thread might create a lot of delayed tasks all sitting in its delayed task queue. The proposed solution only add them to the normal runtime once their dependencies are fulfilled meaning, a thread might have all tasks. This is similar to the load imbalance the single-task-producer benchmark is measuring. It should be solved by adaptative steal half.

Reducing latencies

In the proposed scheme, fulfilling a promise is done by user code so the runtime has no knowledge of tasks that are actually blocking further work. AST analysis could be done to locate such task and place them in a "priority" deque (it wouldn't support priority, they would just be ran or stolen from first)

from weave.

mratsim commented on August 17, 2024

Another extension to Cilk to add dataflow graph aware scheduling: Swan

In terms of benchmark, besides Cholesky a lot of publications are referencing the Smith-Waterman algorithm for DNA and protein sequence alignment and a computational biology benchmark would be a nice change of pace.

See: https://uh-ir.tdl.org/bitstream/handle/10657/558/Thesis12.pdf?sequence=1&isAllowed=y

and https://www.cs.colostate.edu/~pouchet/doc/lcpc-article.15.pdf (polyhedral compiler generated)

Use of dataflow graph parallelism

Dataflow graph parallelism / Stream parallelism / Pipeline parallelism / Data-driven task parallelism (AFAIK it's all the same) is able to express computations dependent on data readiness while plain task parallelism and data parallelism are only aware of control flow.

This is needed for:

iterative stencil computation (i.e. modifying a cell repeatedly over time) which are common in physics (heat diffusion) and image processing.
Recurrent neural networks: https://devblogs.nvidia.com/optimizing-recurrent-neural-networks-cudnn-5/
Parallel video encoding/decoding when a lot of computations are dependent on a video frame being decompressed/computed: https://hal.inria.fr/hal-01289532/document (Adding detaflow to FFMpeg)

This does not take into account intra-frame parallelism, resulting speedup compared to FFMpeg are in the order of 50%~80%

Benchmarking

All the Polybench iterative stencil benchmarks for benchmarking polyhedral compilers are relevant for benchmarking polyhedral dataflow.
Writing a video encoder for benchmarking is probably a bit difficult but who knows

from weave.

Nested for-loops: Expressing reads-after-writes and writes-after-writes dependencies about weave HOT 8 CLOSED

Comments (8)

Side thoughts

Stream Pipeline and graph parallelism

Heterogeneous computing

Research

Ideas

Pending issues:

For-loop dependencies

In distributed runtimes

Delayed task distribution

Reducing latencies

Use of dataflow graph parallelism

Benchmarking

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# 2. for pc = 0,...,k−1 in steps of kc
	for pc in countup(0, K-1, tiles.kc):
	prefetch(tiles.b, Write, LowTemporalLocality)
	let kc = min(K - pc, tiles.kc) # Deal with edges # A[0:M, pc:pc+kc]

	let kcncB = vB.stride(pc, 0) # B[pc:pc+kc, jc:jc+nc]
	pack_B_kc_nc[T, ukernel](tiles.b, kc, nc, kcncB) # PackB panel [kc, nc] (nc is large or unknown)

	# First time writing to C, we scale it, otherwise accumulate
	let beta = if pc == 0: beta else: 1.T

	sync(Weave) # TODO: this cannot be nested
	# ####################################
	# 3. for ic = 0,...,m−1 in steps of mc
	parallelFor icb in 0 ..< tiles.ic_num_tasks:
	captures: {pc, tiles, nc, kc, alpha, beta, vA, vC, M}

	let packA = tiles.a + icb * tiles.upanelA_size
	prefetch(packA, Write, LowTemporalLocality)
	let ic = icb * tiles.mc
	let mc = min(M-ic, tiles.mc) # C[ic:ic+mc, jc:jc+nc]

	let mckcA = vA.stride(ic, pc) # A[ic:ic+mc, pc:pc+kc]
	pack_A_mc_kc[T, ukernel](packA, mc, kc, mckcA) # PackA block [mc, kc]

	gebp_mkernel[T, ukernel]( # GEBP macrokernel:
	mc, nc, kc, # C[ic:ic+mc, jc:jc+nc] =
	alpha, packA, tiles.b, # αA[ic:ic+mc, pc:pc+kc] * B[pc:pc+kc, jc:jc+nc] +
	beta, vC.stride(ic, 0) # βC[ic:ic+mc, jc:jc+nc]
	)
	sync(Weave) # TODO: this cannot be nested