rise-lang / shine Goto Github PK

View Code? Open in Web Editor NEW

70.0 10.0 7.0 11.63 MB

The Shine compiler for the RISE language

Home Page: https://rise-lang.org

License: MIT License

C 21.23% Scala 78.39% Shell 0.01% JavaScript 0.08% CSS 0.04% R 0.26%

rise shine compiler

shine's Issues

Unrolling private arrays to scalar variables.

float x37[3]; // private memory

would become:

float x37_0;
float x37_1;
float x37_2;

It should be possible to perform this unrolling on imperative DPIA.
Combinations of structs and arrays would also be deconstructed to scalar variables in a similar way.

Keeping track of performance metrics

It would be nice to have a way to keep track of some performance metrics over time.
We care about rewriting, type inference, compilation and generated code performance.
Being able to monitor a set of benchmarks, for example as with cargo benchcmp could be useful.

Slow code generation issue with code size blowup

The example expression shown in #83 takes around 4mn to compile and generates huge code (about 60 million characters) with too many ifs.

Refactor DPIA traversals

Refactor DPIA traversals following the refactoring of rise traversals (#98). DPIA traversals are already a lot closer to the current rise traversals. We could however benefit from generalising Stop and Continue to a monad, and invoking visitors directly (instead of returning Continue(t, visitor) and then invoking visitor.visit(t) we can just return visitor.visit(t)`).

Pair assignment decision

In the TranslationContext, we can define assignment of pair types in two different ways:

a single assignment: pAcc := pExp
two element-wise assignments: pAcc.fst := pExp.fst; pAcc.snd := pExp.snd

We have some examples that are broken because they rely on the first solution, while we currently do the second solution (which useful for other examples):

shine/src/test/scala/apps/KMeans.scala

Line 12 in da15a58

ignore("kmeans versions produce same results") {
shine/src/test/scala/apps/NBody.scala

Line 30 in da15a58

ignore("nbody AMD version calls update only once") {

TODO Tensor Cores

TODO

Include headers

Using Tensor Cores requires to include the "mma.h"-Header. At the moment the include-statment is printed in kernel declartion. This is ugly.
It would be better, if the include-statement is part of the C-AST and is inserted only if this header is necessary. To do this it make sense to insert a additional set for headers which should be included in the Environment-class.
CUDA-Executor Include Headers

The "mma.h" header is in some NVIDIA standard-include directory which we currently need to search and specify as argument for the CUDA-Executor. E. g. pass the "--include-path=/opt/cuda/targets/x86_64-linux/include" compiler option to the executor. The CUDA-Executor should be able to serach at standard include directories.
GlobalToShared-primitiv

The global GlobalToShared-primitiv copies data from global memory to shared memory without using registers. To use this, architecture ampere or higher and CUDA 11 or higher is required.
If turing-architecture or lower is used the generated code can be automatically rewritten by CUDA-Compiler. So it does make sense to always use the GlobalToShared-primitiv instead of toLocal. Iinstead of an own GlobalToShared-primitiv it would be better to change the translation of toLocal if we want to generate CUDA-Code.
```
for ( i = ... ) {
    // uses intermediate register
    // {int tmp=g[i]; smem[i]=tmp;}
    smem[i] = gldata[i]; 
}
```
```
pipeline pipe;
// load element into shared mem
for ( i = ... ) {
    // initiate async memory copy
    memcpy_async(smem[i], 
                gldata[i], 
                pipe); 
}

// wait for async-copy to complete
pipe.commit_and_wait();
```
At the moment there is an additional datatype for the pipeline. It does not make sense to model the pipeline as own datatype, because the datatype can be only used within the GlobalToShared-primitiv. The pipeline datatype exits because the translation of GlobalToShared uses new primitiv to declare a pipeline-object.

Additional it would be great if we can track accesses to shared memory to optimize the location of the pipeline.wait() to improve performance.
If we can track this, it could be also used to optimize reuse of shared memory.
Refator Codegeneration for mapFragment-Element-primitiv

The codegeneration looks a little bit ugly. This could be refactored.
Refactor CUDA-Backend and OpenCL-Backend (and shine.C.Program class which allows to run a C-program using the CUDA-Executor)
High-level RISE and Rewrite rules stuff

Improve Performance

Synchronisations

In the generated kernel code are more synchronizations than necessary. Reducing the number of synchronizations could improve the performance.
For example two synchronizations are generated for loading the two block level tiles, although only one synchronization is necessary.
Thomas introduced a new method for insert barriers. This should work for CUDA too.
CUTLASS

"CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication" (https://github.com/NVIDIA/cutlass)

CUTLASS has a relative performance over 80% of CUBLAS. It could be useful to understand how the CUTLASS kernel works to tune our own kernels.
Nvidia profiler

The Nvidia profiler could be also useful to detect bottlenecks and tune the kernel.
It is also useful to analyze the parameters in other kernels (number of threads, block, warps, tiling sizes, shared memory etc.)
Double Buffering

This could improve the performance by hiding latency while loading from global memory.
PTX

At the moment the WMMA-API is used to target Tensor Cores. The WMMA-API does not allow to directly use Tensor Cores. Instead the WMMA-API call is translated differently on different architectures to use competitively multiple Tensor Cores.
As an advantage the WMMA-API does not depend on a specific architecture.

In contrast a PTX instruction directly a single Tensor Core. Because the Tenor Cores distinguish on different architectures the PTX instruction is architectures specific. To generate the correct PTX instruction, we have to know the architecture of the GPU.
On Volta architecture a Tensor Cores calculates a MMA instruction with 4 by 4 matrices by one quad-pair. On Ampere architecture a Tensor Core calculates a a MMA instruction with 4 by 8? matrices by a quad-pair.

Using PTX could improve performance. But it is more more difficult to generate code and express the kernel in low-level RISE.

Bugs

AdjustArraySizesForAllocations

This does not work properly for the expression for store the result matrix in shared memory and then in global memory.
This occurs in the epilogV2-expression in gemmTensor.scala. At the moment there is a temporary bugfix to properly detect the ParallelismInfo of the whole expression as ParallelismInfo "Warp".
The error occurs because the mapWarp inside the expression is not visited, because the mapWarp is nested inside other expressions. So the detected ParallelismInfo is wrong.
TypeCheckBug

Expected to find `(output : acc[n14478.n14479.f32])' in the environment: `HashMap(..., (output : acc[(2*n14478*n14479*(1/^((2*n14479)))).n14479.f32]) -> DeclRef(output), ...)'

The output-identifier is not simplified.
The error only occurs if the whole test-class is tested. If only the single test-case is tested no error occurs.
Running GEMM-Kernel with larger tiling sizes on V100

Since CUDA 11 some of the generated kernels do not run properly. It occurs a CUDA_OUT_OF_RESSOURCES.
I suspect the kernel use to much registers.
Specify the --maxregcount compiler option solved this problem on the Titan RTX.
But the GEMM-kernel with larger tiling sizes does not run on the V100, because of a CUDA_OUT_OF_RESSOURCES.
This does not really make sense because the number of registers of the two GPUs are exactly the same.
oclReduceSeq allocates memory

The oclReduceSeq primitive does allocates always memory for the accumulator. Additional the accumulator need to be initialized. This is not always necessary.
As current solution for a few Tensor Cores kernels the generated code is adjusted to remove the allocated memory and the initialization. This is currently done ugly in kernel-class, when two reduceSeqs use fragments as accumulators.

Remove the XML printer

The XML printer of DPIA is not used anymore and should be removed.

Clear distinction between C, OpenMP and OpenCL primitives

Currently it is possible to create Phrases that are going to be passed to the OpenCL backend that include C and OpenMP primitives such as ReduceSeq, while OpenCLReduceSeq should be used.

The difference between the primitives is that OpenCLReduceSeq takes as an additional parameter the AddressSpace of the accumulator.

I see two angles from which this problem can be viewed at the moment.

One is that that ReduceSeq should not be core which would then allow us to only include core and OpenCL to choose primitives from when creating an OpenCL expression, removing the need for the prefixes (however, C could still be used accidentally).

The reasoning would be:

The OpenCL prefixes in front of these primitives are mainly needed because there are other primitives like ReduceSeq in the core package which will always be included but cannot be used in OpenCL. For a convenient use of the OpenCL primitives, those primitives have completely different names and are not just in another package.

This shows, placing ReduceSeq in core is actually a design mistake, because it cannot be used in OpenCL and there cannot be part of a unifying core between the backends (there are more primitives with this problem such as OpenCL-/SlideSeq or OpenCL-/Iterate).

The other angle this problem can be viewed from is that there should not be a difference between the primitives between the backends. This would mean that there would be only a single ReduceSeq in core and this primitive takes the AddressSpace as a parameter.
Depending on the backend the definitions of the Kinds would change, e.g., AddressSpace = Heap | Stack for C and AddressSpace = Private | Local | Global for OpenCL. However, at least currently, we would not make any use of the address space in C and would be carrying around unused information. Also, I do not immediately see how we can make sure that the correct AddressSpace is used while constructing expressions.

Do you have any thoughts or suggestions for this? Is there another angle that I missed? Which option would you choose?

Generic primitive translation from Rise to DPIA

I worked on formalising the translation and came up with this generic formalism:

We should be able to implement the translation with this generic style instead of the current boilerplate by dealing with the {} -> function type and having a map from Rise primitives to DPIA primitives.

DPIA Macro replacement

#127 replaces the RISE @primitive macro and instead generates Scala code at Scala compilation time.

This has a number of advantages:

no wired macro behavior that confuses us and the IDE
better documentation of our primitives
more flexibility for the future to generate different front- and back-ends for our primitives
removing the macros enables a migration path towards Scala 3

For the reasons above, we should do something similar for the DPIA macros: @expPrimitive, @accPrimitive, and @comPrimitive.

Here are some considerations and points that we have to address for this.

Syntax for DPIA primitives
In #127 we have developed a parser for RISE type schemas. We can reuse parts of this but need to expand it for DPIA.
The syntax implemented in #127 for RISE primitive's is:

def drop: (n: nat) -> {m: nat} -> {t: data} -> (n+m).t -> m.t

For DPIA I suggest the following syntax:

def drop(n: nat, m: nat, t: data, input: exp[n+m.t, read]): exp[m.t, read]

Here are some reasons for this design:

In RISE primitives are functions, but not in DPIA, where drop represents a fully applied function. The syntax above makes this explicit.
By requiring a name for each argument, we don't have to invent names when generating the corresponding class.

We would then generate a class with the following interface from it:

final case class Drop(n: Nat,
                      m: Nat,
                      dt: DataType,
                      array: Phrase[ExpType]
                     ) extends ExpPrimitive {
  array :: expT((n + m)`.`dt, read)
  override val t: ExpType = expT(m`.`dt, read)

  // ...
}

Translation Boilerplate
Currently primitives are translated from RISE to via the (massive) fromRise function, which internally uses InferAccessAnnotation. We should auto-generate this code!

I think that we need to have access to both, the RISE and the DPIA types at the same time to generate this code. Maybe we can create a mapping file that associates the RISE and DPIA primitives (that will be defined in different files). Something like:

rise.core.primitives.drop => shine.DPIA.primitives.drop

We would then parse both definitions to get the types and generate the translation boilerplate code.

Refactorings
There are a number of refactorings that would be useful to do for this:

factor the continuationTranslation, acceptorTranslation, fedeTranslation, and streamTranslation out of the DPIA Primitive classes. This would (as we discussed before) also better follow the functional pattern matching style that we have almost everywhere in the compiler.
remove the eval function from the DPIA Primitive. Nobody uses it and it is barely implemented for any primitives, so that nobody could use it even if they wanted to.
remove the xmlPrinter. Nobody uses it anymore.

I would suggest we do 1. before we replace the macros. Otherwise, we will have to integrate these with the generated code somehow. 2. and 3. are a bit orthogonal to this and just cleanups: we can also keep generating code for these.

Opinions @Bastacyclop, @umazalakain, and others?

Canonical form for OpenMP constructs

Currently, the generated code for mapPar fails to compile with some OpenMP implementations because it does not respect the canonical loop form.

Example:

#pragma omp parallel for
for (int i_244 = 0;(i_244 < n18);i_244 = (1 + i_244)) {
}

error: invalid controlling predicate

Type inference is not idempotent

Running the recently modified type inference twice breaks the types of an expression.

[FEATURE REQUEST] port benchmarks from LIFT to shine

Fantastic benchmarks and where to find them

A list of benchmarks (mostly cgo18), which are not in the current shine repository and the location of their LIFT implementation.

Redefine equality of RISE primitives based on their names

Alpha equivalence judges two primitives equal iff they have the same name. The macros for primitive generation however overwrite the .equals of primitives and judge two primitives equal if their types are alpha equivalent. We should either remove the .equals overwrite and deal with the consequences, or make it compare names in line with what alpha equivalence does.

Refactor OpenCL and CUDA backends

After #100, the code should be refactored to avoid code duplication between CUDA and OpenCL.
Some notes about that:

both backends implement HoistMemoryAllocations and AdaptKernelBody passes, we should look into merging them.
there is InjectWorkItemSizes for OpenCL and InjectThreadSizes for CUDA, they basically do the same thing.
CUDA has mapBlock/mapThreads/mapGlobal, OpenCL has mapWorkgroup/mapLocal/mapGlobal, we can have a common abstraction. mapWarp/mapLane are also available in CUDA, in the future it might also make sense to have these for OpenCL (e.g. to use AMD wavefronts), but this is a device-dependent feature (it makes no sense on a CPU device). This also impacts ParallelismLevel and imperative parFors.
OpenCL has an InsertMemoryBarriers pass, something similar has to be done for CUDA and the code could be shared

Continuous code style analysis?

I have recently introduced a basic scalastyle config (https://github.com/rise-lang/shine/blob/master/scalastyle_config.xml). IntelliJ should automatically pick up this file and check the code against some style rules. We can obviously think about progressively modifying the current rules according to our needs.

This can be compared to scalafmt, except that we had issues with autoformatting. This is a less invasive way to enforce some style rules, producing warning and errors instead of changing the code.

Ideally, we would setup something like Codacy to integrate the scalastyle feedback into our github workflow and PR review process.

@rise-lang/core

Make `Expr.hashCode` follow `Expr.equals`

Expr.hashCode tests for syntactic equality, while Expr.equals tests for alpha-equivalence. Discriminating by Expr.hashCode is only ever used to then discriminate by Expr.equals, the former should be more liberal.
A more comprehensive solution is discussed in #77.

Clear expression and type equality definitions

We need to properly discuss equality for expressions and types so that we have one or if necessary multiple clear definitions.
I believe that we need multiple definitions, depending on the context, but we should clarify them and discuss what should be the default one (==).

Explicit dependence of type variables in DepFuns.

Unification of dependent functions that contain type variables is not trivial - and it's a necessary step in the current type inference process.
Consider the unification of the following dependent functions

f1: n:nat => e1
f2: m:nat => e2

The naive algorithm proceeds by doing

1) introduce new nat k
2) n = k
3) m = k
4) unify(e1{k/n}, e2{k/m})
[...]

Superficially, this works fine. Consider the case however when e1 and e2 are just type variables t1, and t2
Repeating the above...

1) introduce new nat k
2) n = k
3) m = k
4) unify(t1{k/n}, t2{k/m}), whis is  the same as
   unify(t1, t2)
5) t1 = t2

This only works if t1 and t2 stand in for types which are not dependent on n and m respectively! If they do, by the time
we have performed the substitution on step 4, this dependence is completely lost.

Without having Dependent Arrays, situations of the sort seem to never happen (though I am not sure they are actually impossible). But with dependent arrays, a simple program such as depMapSeq(nFun(_ => mapSeq(fun(x => x + 1)))) xs runs into this very problem: when resolving the constraints for the type of nFun(_ => mapSeq(fun(x => x + 1)))), we are dealing with type variables which are, in reality, dependent on the depFun's argument - we just don't know yet.

A solution for tackling this problem is to make this dependence explicit. That is, changing the algorithm to do

unifying (n:nat => t1, m:nat => t2)
1) introduce new nat k
2) n = k
3) m = k
4) create fresh f1: nat => data
5) create fresh f2: nat => data
6) t1 = f1(k)
7) t2 = f2(k)
7) unify(t1{k/n, f1(k)/t1}, t2{k/m, f2(k)/t2}), whis is  the same as
   unify(f1(k), f2(k))
8) f1 = f2

This preserves the dependence: later on, if we resolve for example that t1 = k.int, and we know that f1(k) = t1, we can recover that f1 = (i:nat) => i.int, and so f2. Without this crucial step, this would be impossible to do - the problem that happens in practice with dependent arrays.

Doing this transformation is however very slow (lots of rebuilding of ASTs, big solutions) and invasive - and seems to break inference in the most complex cases due to the arithmetic expression unifier not necessarily being developed to handle these cases.
So, for now, the feature is "gated" behind a flag that is pass-on to the type inference system.

Can we have a better "in principle" solution?
Are we sure that this problem only appears in programs with dependent arrays - or really we have just been "lucky" so far?

Add support for directed type inference constraints

We currently use freeze and unfreeze to make type inference directed by converting implicit type variables into explicit type variables, solve constraints (which will substitute implicits with explicits but never the other way around), and convert back. This is very unintuitive and prone to errors. We could instead have explicit Undirected : Type -> Type -> Constraint and Directed : Type -> Type -> Constraint.

Rewrite arithexpr as a language

Using the current arithexpr library has a host of issues:

The subtyping integration with RISE and DPIA is prone to bugs. Identifiers in arithexpr are extended with explicitness information, which is sometimes lost when translating to arithexpr and back. On the flip side, range information in arithexpr is often lost when translating to RISE/DPIA and back.
Arithexpr offers no support for lambdas. This results in NatToNat constructs in both RISE and DPIA.

A new arithexpr library should:

Implement a type system that carries range information. A range is a minimum and a maximum bound, and a bound is an arithmetic expression (which may be +/- infinity). Note that these arithmetic expressions can contain identifiers too.
Support lambda abstraction.
Integrate well with RISE/DPIA. The most reasonable way to do this seems to not keep any metadata in the identifiers themselves. Instead, we keep metadata in an environment. Translating to and from RISE/DPIA needs to translate the environment as well. The addition of lambda abstractions might pose an issue here, as bindings might have to contain some metadata as well.
- It should be easy to embed Rise/DPIA expressions into Nat expressions (with a dedicated Nat constructor)
[add yours]

Improve @rule notation to wrap a partial function

This would avoid explicitly asserting the right hand side type (!: e.t) and explicitly constructing a Success.

// before
@rule def `*f -> S >> **f >> J`(n: Nat): Strategy[Rise] = {
    case e @ App(map(), f) => Success((split(n) >> map(map(f)) >> join) !: e.t)
}

// after
@rule def `*f -> S >> **f >> J`(n: Nat): Strategy[Rise] = rewrite {
    case e @ App(map(), f) => split(n) >> map(map(f)) >> join
}

Investigate why type erasure is necessary in rise.elevate.movement equality

Type erasure was added in the betaEtaEquals implementation of the rise.elevate.movement test (#74 (comment)). Without type erasure, the test will randomly fail for reasons yet to be investigated. However erasing the types should intuitively not be necessary and this seems to indicate a bug or subtle issue with the ToBeTyped DSL mechanisms.

Make `Kind.Identifier` and `Kind.Explicitness` sealed traits

Both Kind.Identifier and Kind.Explicitness are open traits, however we often rely on pattern matching on their implementation types.

Distinguish between untyped and typed expressions at the type level

Currently, Exprs have Types, and Types can be untyped by being TypePlaceholders. Then we have ToBeTyped, which we use to erase all Types with TypePlaceholders and drive type inference. Places that assume an Expr is fully typed (i.e. has no TypePlaceholders) cannot express this through types.

How desirable would it be to take TypePlaceholder out of Type, turn Expr into Expr[T] where for e : Expr[T] we have e.t : T, and then handle Expr[Unit] for untyped expressions and Expr[Type] for fully typed ones?

`makeClosed` does not close expressions correctly

Currently makeClosed only closes the type variables in the type of the outer term: type variables that do not appear in the type of the outer term but that do appear in the type of subterms will not be closed by makeClosed.

[BUG] Hypermapper client server

Preparation

Install HM using pip: pip install hypermapper
branch git checkout autotune-jl

Main files

Tests in src/test/scala/rise/autotune/hypermapper_client_server.scala
Python scripts in folder autotuning

Test `minimal example`

Starts sub_proc_example.py as subprocess and performs some communication

Test `hm client server` [BUG]

Starts hm_wrapper.py as subprocess, which performs tuning in client server mode using convolution.json
hm_wrapper.py does not get a new request from HM, which means the str_to_hypermapper answer does not reach the hm_wrapper.py.

Test communication to HM

Running python hm_wrapper.py --config_file convolution.json

type in: tile,vec,runtime,Valid
prints two random values (here called x,y)
type in: x,y,1,True
type in: x,y,1,True
New request is printed to the screen, which means, communication works fine

Add property based tests with scalacheck

The traversals PR (#98) has highlighted that the existing battery of tests sometimes misses (big) mistakes in the behavior of traversals. Property based testing can be of great help here, as it allows for a much bigger surface area to be tested, and also serves as (executable) documentation of some of the properties of Rise programs. Possible properties to investigate include:

type-erase ∘ type-infer == id
type-infer expr == Success typedExpr => NoTypePlaceholders typedExpr

Better compilation messages

Most of our compilation messages come from fatal errors that throw exceptions, often with hard to understand messages. We should progressively improve on this.

I started adding WARNING: prints, but we could have a more serious logging system or something like a CompilationResult ADT to gather information, warnings and maybe errors during compilation.

RISE Syntax Discussion

Hi everyone.

At the start let me point to Wadler's Law (before the fighting begins ;-)):

In any language design, the total time spent discussing
a feature in this list is proportional to two raised to
the power of its position.
	0. Semantics
    1. Syntax
    2. Lexical syntax
    3. Lexical syntax of comments

This issue should be a space to discuss syntax for RISE.

I believe we should aim for:

consistency
clarity and closeness to the formal syntax and to existing FP conventions (and yes, there are many to choose from ...)

Here is a summary of most of the syntax we have currently in RISE.
#66 would change some of this syntax (especially nFunT to forallNat).

Term Syntax

Lambda:
- fun(x => e)
- fun(t)(x => e)
DepLambda:
- nFun(n => e) | nFun(r, n => e)
- dtFun(dt => e)
App:
- f(e)
- e |> f
- f $ e
DepApp:
- f(n) | f(dt) | f(a) | f(n2n) | f(n2d)
Literal:
- l(i) | l(f) | f(d)
- lidx(i, n)
- lvec(v)
- larr(a)
Function composition:
- f >> g
- g o f

Type Syntax

FunType:
- t1 ->: t2
DepFunType:
- nFunT(n => t)
- dtFunT(dt => t)
- aFunT(a => t)
- n2nFunT(n2n => t)
- n2dFunT(n2d => t)
Inferred (implicit) type parameter:
- implNat(n => t)
- implDt(dt => t)
- implType(ty => t)
- implBT(bt => t)
- implST(st => t)
- implAddr(a => t)
- implN2n(n2n => t)
- implN2DT(n2d => t)

DataType Syntax

ScalarType:
- bool
- int
- i8 | i16 | i32 |i64
- u8 | ...
- f16 | ...
VectorType:
- vec(n, dt)
PairType:
- t1 x t2
ArrayType:
- n.t
DepPairType:
- n2dPairT(n => t)
- nats2dPairT(ns => t)
DepArrayType:
- n..(i => t)
- n..(n2d)
NatToDataApply:
- n2d(n)

NatToData Type Syntax

NatToDataLambda:
- n2dtFun(n => t) | n2dtFun(r)(n => t) | n2dtFun(u)(n => t)

NatToNat Type Syntax

NatToNatLambda:
- n2nFun(n => t) | n2nFun(r)(n => t) | n2nFun(u)(n => t)

Infer Matrix Layout and Leading Dimensions when going from RISE to DPIA

After #100

Currently, matrix layouts and leading dimensions for Fragment types are inferred during code generation but should probably be inferred when inferring other types, going from RISE to DPIA.

For that: Analyze the primitives that change the matrix layout and infer the dimensions from there.

Rename and use TopLevel definitions

See #76 (comment)

Setup script does not build the executor

Address spaces for C

It would be very useful to have a toMem primitive for C with different address spaces, that is toStack and toHeap. Maybe @bastian-koepcke can add that small feature on top his read/write annotations work?

Traversals should visit Nats inside NatToNat and NatToData.

Currently, if you define a visitor such as:

    case class Visitor() extends traversal.Visitor {
      override def visitNat(in: Nat): Result[Nat] =
        Continue(substitute.natInNat(n, `for`, in), this)
    }

This will not visit nats inside Nat2Nat or Nat2Data constructs, using the default implementations that skip them:

    def visitN2N(n2n: NatToNat): Result[NatToNat] = Continue(n2n, this)
    def visitN2D(n2d: NatToData): Result[NatToData] = Continue(n2d, this)

This breaks code such as substitute.natsInType:

shine/src/main/scala/rise/core/substitute.scala

Lines 116 to 123 in e2e1039

 def natInType[T <: Type](n: Nat, `for`: NatIdentifier, in: T): T = { 

 case class Visitor() extends traversal.Visitor { 

 override def visitNat(in: Nat): Result[Nat] = 

 Continue(substitute.natInNat(n, `for`, in), this) 

 } 

 traversal.types.DepthFirstLocalResult(in, Visitor()) 

 }

To fix this, the contained nats should be visited.

Remove type assertions: have rewrites explicitly pass FTVs to inference

Type assertions are currently only ever used by rewrite rules, to make sure that the free type variables of open terms are preserved by the rewrite once inference is ran on the rhs. We could instead remove type assertions, collect the FTVs to preserve before the rewrite, and hand them to type inference directly.

Investigate when barriers need to be generated

Currently, we are generating barriers only (and always) after generating a parForLocal. This is probably too much in many cases and may not be enough in some others.

Change the pattern matching style in fromRise

We should change the pattern matching style in fromRise to avoid breaking unreachablility analysis:

shine/src/main/scala/shine/DPIA/fromRise.scala

Lines 106 to 120 in 73ea2b4

 @silent("unreach") 

 private def primitive(p: r.Primitive, 

 t: PhraseType): Phrase[_ <: PhraseType] = { 

 import rise.openCL.{primitives => ocl} 

 import rise.openMP.{primitives => omp} 

 import shine.OpenCL.FunctionalPrimitives._ 

 import shine.OpenMP.FunctionalPrimitives._ 

 import shine.DPIA.Types.MatchingDSL._ 

 ((p, t): @unchecked) match { 

 case (core.PrintType(msg), expT(dt: DataType, w) ->: _) => 

 fun[ExpType](expT(dt, w), e => PrintType(msg, dt, w, e)) 

 case (core.NatAsIndex(), 

 nFunT(n, expT(`NatType`, `read`) ->:

Instead of matching over the (p, t) pair, we could first match over p before mathing over t. Macros or other helpers could help to refactor this without too much boilerplate changes.

Reduce boilerplate in definition of fully applied DPIA primitives

When defining DPIA primitives, we have to define a lot of boilerplate including writing the type of arguments both in scala and in the DPIA world, defining visitAndRebuild, xmlPrinter, ..
This is especially painful when refactoring (#37), dealing with merge conflicts, or when new boilerplate needs to be added. For example, defining a dot printer or other pretty printer as done in rise (@bastianhagedorn, @roger-uw) would require much more boilerplate for DPIA expressions.

We could make this easier by either defining primitives as functions as done in rise (which does not prevent us from maintaining a fully applied normal form where practical or even defining convenient extractors), and/or by factorising code, using macros, ..

One old improvement idea was to make traversal more generic than with visitAndRebuild.

abstract class PrimitiveExpr {
  override def children(): List[Node] = [f, init, array]
  override def rebuild(t: Option[Type]): List[Node] => PrimitiveExpr
}

Which would basically allow to recover the benefits of defining primitives as functions by having access to the implicit application nodes hidden in our fully applied primitives.

Host code generation and multiple hardware devices

It would be very useful to be able to generate host code orchestrating multiple kernels, potentially on multiple hardware devices. This requires thinking about how the Rise programs would look, and how to generate code for them.

Determine if enforcing unique names is required (and implement it) or not

After switching to the unified DSL with #76 we no longer enforce unique names in Rise.

Do we still need to do this? Currently, all tests pass without enforcement of this.

Can we find a counterexample that shows that this is still required?

If we need to enforce unique names for binders, we need to implement this at the value and type level.

multiple test_util packages

In each of our shine, rise and elevate repositories we have a test_util package at the top level. Supposedly this causes confusion in the IntelliJ IDE such that no tests relying on this can be executed. Moving them one level in (ex. shine.test_util) should remove the ambiguities.

Simplify code generation pass?

Our current code generation pass is becoming more and more complex. It would be desirable to factor out some logic such as the resolution of indexing primitives (split, take, ..) to simple idx chains.

Remove global counters

I think that we should remove all global counters except the ones used in the rise DSL. For example during codegen we should use a program-wide counter instead of a compiler-wide counter. Currently generating code mulitple times keeps using bigger and bigger numbers, this is annoying for inspection and reproducibility and serves no purpose.

@rise-lang/core

Use integer-based variable names

I think that whichever way we go (names, indices, or pointers), we should go towards only using strings for debugging purposes and using integers for identification purposes, since that would speep up the compiler (a bit like string interning but without using strings in the first place). So comparing two variables should only cost a quick integer comparison instead of the current string comparison which is costly. For example if we keep a "name-based" representation an identifier could be:

case class Identifier(tag: String, id: Int) {
  def toString = s"$tag$id"
  def equals(other) = this.id == other.id
  def hashCode() = id.hashCode()
}

So when you look at a program as a human you might see variables tile0, tile1 and x2, but our compiler would only really care about variables 0, 1 and 2 for processing.

Originally posted by @Bastacyclop in #81 (comment)

Bring the documentation into the codebase

The repository in https://github.com/rise-lang/doc contains documentation about the DPIA phases. This documentation often becomes outdated whenever new primitives are added to DPIA. We should bring the documentation into the codebase, close to where the primitives are defined, and defer the compilation of these documents to scaladoc or something similar.

fromRise fails when a type variable should be a basic type (or other constraint)

Example:

  // average two positive values rounding up
  val avg: Expr = dtFun(dt => dtFun(compute_dt => fun(
    dt ->: dt ->: dt
  )((a, b) => {
    val cast_ = fun(a => cast(a) :: compute_dt)
    cast((cast_(a) + cast_(b) + cast_(l(1))) / cast_(l(2))) :: dt
  })))

  test("failure") {
    shine.DPIA.fromRise(infer(avg(i16)(i32)))
  }

More precisely, the pattern match in fromRise cannot not pass since TypeVariable is not a BasicType:

      case (core.Cast(), lt.FunType(la: lt.BasicType, lb: lt.BasicType))
      =>
        val a = basicType(la)
        val b = basicType(lb)
        fun[ExpType](ExpType(a, read), x =>
          Cast(a, b, x))

This is closely related to rise-lang/rise#11.

Gaussian algorithm [BUG]

Reproduce
Run test "baseline" here

Implementation
Implementation is inspired by this LIFT implementation

  val gaussWeights = List(
    List(2,4,5,4,2),
    List(4,9,12,9,4),
    List(5,12,15,12,5),
    List(4,9,12,9,4),
    List(2,4,5,4,2)
  )

  val N = 1024
  val M = 1024


  val gauss: Rise = //infer(
    fun(ArrayType(N, ArrayType(M, f32)))(in =>
      fun(ArrayType(5, ArrayType(5, f32)))(weights =>
        in |> padClamp2D(2) // in: NxM -> (N+4) x (M+4)
          |> slide2D(5,2) // -> MxN of 5x5 slides
          |> map(map(fun(sector => // sector:5x5
            zip(sector)(weights) //(v,w)
            |> map(fun(rowPair =>
              zip(fst(rowPair))(snd(rowPair))
              |> map(fun(pair =>
                fst(pair) * snd(pair)
              )) |> reduce(add)(l(0.0f))
            )) |> reduce(add)(l(0.0f))
        )))
      )
    )

Error Message

[BASELINE] rewrite time: 0s
skipping type inference
free variable _n53


expression is not in closed form:  λe1 λe2 mapSeq
         ┝ λη142 mapSeq
         │       ┝ λe16 let
         │       │      ┝ reduceSeq <λη150. λη158. add η150 η158> (<λe166. e166> 0.0000)
         │       │      │ ┕ mapSeq
         │       │      │   ┝ λe20 let
         │       │      │   │      ┝ reduceSeq <λη146. λη154. add η146 η154> (<λe162. e162> 0.0000)
         │       │      │   │      │ ┕ mapSeq <λe28. mul (fst e28) (snd e28)>
         │       │      │   │      │   ┕ zip (fst e20) (snd e20)
         │       │      │   │      ┕ λe116. e116
         │       │      │   ┕ zip e16 e2
         │       │      ┕ λe129. e129
         │       ┕ η142
         ┕ map transpose
           ┕ slide 5 2
             ┕ map (slide 5 2)
               ┕ padClamp 2 2 (map (padClamp 2 2) e1)

 with type (1024.1024.f32 -> (5.5.f32 -> _n50._n53.f32))
java.lang.Exception: expression is not in closed form:  λe1 λe2 mapSeq
         ┝ λη142 mapSeq
         │       ┝ λe16 let
         │       │      ┝ reduceSeq <λη150. λη158. add η150 η158> (<λe166. e166> 0.0000)
         │       │      │ ┕ mapSeq
         │       │      │   ┝ λe20 let
         │       │      │   │      ┝ reduceSeq <λη146. λη154. add η146 η154> (<λe162. e162> 0.0000)
         │       │      │   │      │ ┕ mapSeq <λe28. mul (fst e28) (snd e28)>
         │       │      │   │      │   ┕ zip (fst e20) (snd e20)
         │       │      │   │      ┕ λe116. e116
         │       │      │   ┕ zip e16 e2
         │       │      ┕ λe129. e129
         │       ┕ η142
         ┕ map transpose
           ┕ slide 5 2
             ┕ map (slide 5 2)
               ┕ padClamp 2 2 (map (padClamp 2 2) e1)

 with type (1024.1024.f32 -> (5.5.f32 -> _n50._n53.f32))
	at shine.DPIA.fromRise$.apply(fromRise.scala:17)
	at util.gen$.toDPIA(gen.scala:10)
	at util.gen$.CProgram(gen.scala:14)
	at rise.elevate.gauss.run(gauss.scala:108)
	at rise.elevate.gauss.$anonfun$new$1(gauss.scala:64)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
	at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
	at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
	at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562)
	at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
	at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
	at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1562)
	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
	at scala.collection.immutable.List.foreach(List.scala:333)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
	at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
	at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
	at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
	at org.scalatest.Suite.run(Suite.scala:1112)
	at org.scalatest.Suite.run$(Suite.scala:1094)
	at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
	at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
	at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
	at org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1562)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1314)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1308)
	at scala.collection.immutable.List.foreach(List.scala:333)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1308)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1474)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971)
	at org.scalatest.tools.Runner$.run(Runner.scala:798)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:38)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:25)



Process finished with exit code 0

Make mmCheck tests for AMD and Nvidia pass

The mmCheck tests for AMD and Nvidia are ignored for now, as they both fail with a CL_OUT_OF_RESSOURCE error on the continues integration setup.

Make mmCheck on AMD work
Make mmCheck on Nvidia work

Update scala version to 2.13

This would also prepare the transition to scala 3.

	def natInType[T <: Type](n: Nat, `for`: NatIdentifier, in: T): T = {
	case class Visitor() extends traversal.Visitor {
	override def visitNat(in: Nat): Result[Nat] =
	Continue(substitute.natInNat(n, `for`, in), this)
	}

	traversal.types.DepthFirstLocalResult(in, Visitor())
	}

	@silent("unreach")
	private def primitive(p: r.Primitive,
	t: PhraseType): Phrase[_ <: PhraseType] = {
	import rise.openCL.{primitives => ocl}
	import rise.openMP.{primitives => omp}
	import shine.OpenCL.FunctionalPrimitives._
	import shine.OpenMP.FunctionalPrimitives._
	import shine.DPIA.Types.MatchingDSL._

	((p, t): @unchecked) match {
	case (core.PrintType(msg), expT(dt: DataType, w) ->: _) =>
	fun[ExpType](expT(dt, w), e => PrintType(msg, dt, w, e))

	case (core.NatAsIndex(),
	nFunT(n, expT(`NatType`, `read`) ->: