kindelia / hvm Goto Github PK

View Code? Open in Web Editor NEW

10.2K 99.0 386.0 7.03 MB

A massively parallel, optimal functional runtime in Rust

Home Page: https://higherorderco.com

License: Apache License 2.0

Rust 28.90% C 32.74% Cuda 38.03% Typst 0.32%

hvm's People

Stargazers

Watchers

Forkers

nothingnesses racs4 steinerkelvin victortaelin kindelia-archive lqd eduardorfs chlin501 samuelschlesinger devdoshi gracig layus knarkzel carlpaten jmorag why-not-try-calmer ken-okabe dylansp wmeingott bonedaddy pravinshahi0007 albericod ejmg darknessrdg davidebriani vitalibarozzi iuriimattos2 juststudies hc747 pilla raphaelraia craigbosco nataneb32 zijer crypto-forks npatsakula romatthe eternalerrors hummer12007 raulgrell jhhuh kokizzu liamdgray cauefcr doytsujin agentsim longjohncoder cyberflamego uecker ruhuansanmei bocaiwen franchufranchu ccmlm kconner tsnl tchigher mapbased li-ch kkloberdanz damienstanton yibit timesgraph eduardoquelhas isgasho nipzu stoand leoriether dhiltonp maxsun diakopter fritzo zhoudongxu sgraf812 cometyang samtoth variousforks phial3 mu-l mbrukman classicvalues denisgorbachev noneyas dwainbrowne acorrenson sarvex leegoudan polymath-is haskell-monad laplacekorea alex-mckenna sethahrenbach enjection rafaelnsantos hhefesto miguel-nascimento kings177 artaxerces danuker femtomc dariosucic

hvm's Issues

Improve the left-hand side flattener

HVM's runtime only handles direct, non nested matches. So, for example, this is a valid rule:

(Foo (Bar xs)) = (A xs)

But this is not:

(Foo (Bar (Cons x xs))) = (A (Cons x xs))

Because there is a nested match. In order for these cases to work, we must split the left-hand of nested patterns, into many non nested equations. Currently, the language has a flattener that is limited, and will fail in some cases. For example, the function below:

Fn (Bar (Cons ...)) = A
Fn (Bar Nil)        = B
Fn x                = C

Should be compiled to:

Fn (Bar _0)         = (Fn.0 _0)
  (Fn.0 (Cons x y)) = A
  (Fn.0 Nil)        = B
  (Fn.0 x)          = C
Fn x                = C

But the current flattener can't deal with that.

Quadaric time & space on heavily shared lazy lists

Consider the program calculating sum (map head (tails list)). Due to the laziness, though the normal form of tails list has O(n^2) elements, I'm expecting the whole expression to cost linear time since we don't need to fully evaluate each lazy lists, and the number of elements we actually traversed are just n.
But unfortunately, HVM cost quadaric time & space on this scenario, indicating it's not lazy on sub-lists and all sub-lists elements are actually generated.

HVM:

(Gen 0) = Nil
(Gen n) = (Cons n (Gen (- n 1)))

(Foldr f z Nil) = z
(Foldr f z (Cons x xs)) = (f x (Foldr f z xs))

(Tails Nil) = (Cons Nil Nil)
(Tails (Cons x xs)) = (Cons xs (Tails xs))

(Map f Nil) = Nil
(Map f (Cons x xs)) = (Cons (f x) (Map f xs))

(Head Nil) = 0
(Head (Cons x xs)) = x

(Main n) = (Foldr @a@b(+ a b) 0 (Map @l(Head l) (Tails (Gen n))))

Haskell:

import Prelude hiding (head, foldr, tails, map)
import Data.Word (Word32)
import System.Environment (getArgs)

data List a = Nil | Cons a (List a)

gen 0 = Nil
gen n = Cons n (gen (n - 1))

foldr f z Nil = z
foldr f z (Cons x xs) = f x (foldr f z xs)

tails Nil = Cons Nil Nil
tails (Cons x xs) = Cons xs (tails xs)

map f Nil = Nil
map f (Cons x xs) = Cons (f x) (map f xs)

head Nil = 0
head (Cons x xs) = x

main = do
    n <- read . (!! 0) <$> getArgs :: IO Word32
    print $ foldr (\a b -> a + b) 0 (map (\l -> head l) (tails (gen n)))

Run time:

n	HVM (clang, O2, single thread)	GHC, O2
1000	131ms	10ms
2000	316ms	10ms
4000	1082ms	12ms
8000	5.039s (1.6G mem)	15ms (5.0M mem)
16000	31.018s (5.9G mem)	20ms (5.6M mem)

I'd think it's a serious issue since it's quite a common pattern to generate lazy data structures and share their parts everywhere (eg. B-Tree, Finger Tree).

Crazy idea: team with SimulaVR to make next-gen UX

I've just noticed the recently "kickstartered" SimulaVR "virtual reality computer". Which is cool on its own, but what's cooler is the UX software is written in Haskell.

They seem to plan to make a second version of the SW which they still seem to contemplate the design & workings of. Wouldn't it be cool to write this next gen UX SW in something which targets HVM (e.g. Kind)?

It didn't seem to me there is extreme pressure for the next gen version which should rather be based on the experience & usage data gathered from the current version which on its own will take months at least.

@georgewsinger if you find some time, feel free to chime in here 😉.

`cargo install hvm` considered harmful

ToT (61a331f) doesn't appear to compile with gcc version 11.2.0 (from Ubuntu 21.10)

$ hvm c bench/Fibonacci/main.hvm
Compiled to 'bench/Fibonacci/main.c'.
$ gcc -O2 bench/Fibonacci/main.c
bench/Fibonacci/main.c:932:5: error: variably modified ‘normal_seen_data’ at file scope
  932 | u64 normal_seen_data[NORMAL_SEEN_MCAP];
      |     ^~~~~~~~~~~~~~~~

Digging in I see

#define NORMAL_SEEN_MCAP (HEAP_SIZE/sizeof(u64)/(sizeof(u64)*8))
const u64 HEAP_SIZE = 8 * U64_PER_GB * sizeof(u64);

Patterns with inconsistent number of parameters are sometimes accepted

The following program is accepted, runs and outputs (Foo 0 (Succ (Zero)))

(Foo Zero) = None
(Foo a (Succ pred)) = (Some a)

(Main) = (Foo 0 (Succ Zero))

I expected it to give an error because of the inconsistent number of parameters or for the output to be (Some 0).
On the other hand this produces an error

(Foo Zero) = None
(Foo a (Succ pred)) = (Some pred)

(Main) = (Foo 0 (Succ Zero))

Parse error on expression generated by interpreter

I was implementing some lambda-encoded list functions when I ran across this error. Running the following code;

Pair = λx λy λp (p x y)
Bimap = λf λg λp λq (p (λx λy (q (f x) (g y))))
Uncurry = λf λp (p (λx λy (f x y)))

Inl = λx λl λr (l x)
Inr = λx λl λr (r x)

Nil = ((Inl) (λx x))
Cons = λx λxs ((Inr) ((Pair) x xs))

Y = (λ f ((λ x (f (x x))) (λ x (f (x x)))))

// ListFMap : (X -> Y) -> 1 + A x X -> 1 + A x Y
ListFMap = λf λlf λl λr (lf l (λp (r ((Bimap) (λx x) f p))))

// ListCata : (1 + A x S -> S) -> [A] -> S 
ListCata = λa ((Y) (λf λl (a ((ListFMap) f l))))

// ListAlg : B -> (A -> B -> B) -> (1 + A x B -> B)
ListAlg = λn λr λlf (lf (λu n) ((Uncurry) r))

Foldr = λn λr ((ListCata) ((ListAlg) n r))

Append = λl1 λl2 ((Foldr) l2 Cons l1)

Concat = ((Foldr) Nil Append)

Main = ((Concat) ((Cons) ((Cons) 1 Nil) Nil))

causes the following error;

Reducing.
Expected Term:
    0 | λ_ λx1 (x1 λx2 ((x2 1) (^19 ^60)))

thread 'main' panicked at 'No parse.'

with a red underline between ^19 and ^60.

Modified example program outputs expression instead of result

I modified the example program:

// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ ((+ (Sum a) (Sum b))) ((- (Sum a) (Sum b))))

// Performs 2^n additions in parallel
(Main n) = (Sum (Gen n))

It often outputs 256, but sometimes outputs an expression. The expressions seem to be different:

$ time ./main 8
Rewrites: 304812 (2.60 MR/s).
Mem.Size: 198923 words.

((((64+((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))))+(64-((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))))))))))+128)+0)

real	0m0.146s
user	0m0.052s
sys	0m0.111s

Rewrites: 305163 (2.14 MR/s).
Mem.Size: 200042 words.

(256+(((64+((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32)))+(64-((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32))))-128))

real	0m0.164s
user	0m0.045s
sys	0m0.130s

[Question] Doesn't `Fold` from README.md duplicate its function-argument?

The Fold:

// Folds over a list
(Fold  Nil        c n) =  n
(Fold (Cons x xs) c n) = (c x (Fold xs c n))

The 2:

let g = λf(λx(f (f x)))
(g g)

How is it possible that (g g) is a no-no, but the Fold, supposedly, is able to be used twice?

P.S.: Are the following scenarios possible and correct:

(Fold lst (\x y. ... (Fold ...) ...) 5);
(Fold lst Plus (... (Fold ...) ...))?

The limitation seem to be there to prevent 2 non-annihilating Dup nodes to go one through another and becoming 4, isn't it?

Integer overflow in numeric operation inlining

Compiling this:

(Main) = (* 2147483648 2147483648)

Outputs this C code:

            case _MAIN_: {
              if (1) {
                inc_cost(mem);
                u64 ret_0;
                if (get_tag(U_32(2147483648)) == U32 && get_tag(U_32(2147483648)) == U32) {
                  ret_0 = U_32(get_val(U_32(2147483648)) * get_val(U_32(2147483648)));
                  inc_cost(mem);
                } else {
                  u64 op2_1 = alloc(mem, 2);
                  link(mem, op2_1 + 0, U_32(2147483648));
                  link(mem, op2_1 + 1, U_32(2147483648));
                  ret_0 = Op2(MUL, op2_1);
                }
                u64 done = ret_0;
                link(mem, host, done);
                clear(mem, get_loc(term, 0), 0);
                init = 1;
                continue;
              }
              break;
            };

Where

ret_0 = U_32(get_val(U_32(2147483648)) * get_val(U_32(2147483648)));

Result of this multiplication can be more than 32 bits. This can cause type confusion and code execution if you were to run arbitrary hvm code.
same goes for (<< 3 60)

Clang fails to compile generated C code

As of commit 22b0019 and using main.hovm from the readme as the input source, running the compilation step:

clang -O2 main.hovm.out.c -o main.hovm.out

fails with:

main.hovm.out.c:964:23: warning: incompatible pointer to integer conversion assigning to 'Thd' (aka 'unsigned long') from 'void *' [-Wint-conversion]
    workers[t].thread = NULL;
                      ^ ~~~~
1 warning generated.
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: /tmp/main-a67557.o: in function `ffi_normal':
main.hovm.out.c:(.text+0x32fb): undefined reference to `pthread_create'
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: main.hovm.out.c:(.text+0x3311): undefined reference to `pthread_create'
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: main.hovm.out.c:(.text+0x3327): undefined reference to `pthread_create'
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: main.hovm.out.c:(.text+0x333d): undefined reference to `pthread_create'
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: main.hovm.out.c:(.text+0x3353): undefined reference to `pthread_create'
/nix/store/cdm6zywd51mbabxhklsixwcskv4n70s3-binutils-2.35.2/bin/ld: /tmp/main-a67557.o:main.hovm.out.c:(.text+0x3369): more undefined references to `pthread_create' follow
clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

Looking in main.hovm.out.c, I can see that there's indeed a call to a function pthread_create on line 971, but this function doesn't seem to be defined anywhere in the rest of the file.

System info:

$ clang --version
clang version 7.1.0 (tags/RELEASE_710/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/ass1sf1bx07qvlrg02nymxnyzp1cpxz7-clang-7.1.0/bin
$ uname -a
Linux a 5.15.16_1 #1 SMP Thu Jan 20 17:19:29 UTC 2022 x86_64 GNU/Linux

Are there any other requirements aside from clang that I'm supposed to have? I also tried installing llvm and lld, but these didn't help.

Add a Background.md document?

Hi,

First of all, implementing evaluation under lambdas is really cool! Thank you.

Second, could you please add some references or links to either HOW.md or a new Background.md document? I see that you mentioned some links in issue #60:

https://www.researchgate.net/publication/235778993_The_optimal_implementation_of_functional_programming_languages
https://arxiv.org/abs/1701.04240
Also, I would be interested to see a figure illustrating the reduction steps that HVM does in pictorial form.

`runtime.c` template as valid C file

Right now the src/runtime.c template is not a valid C file before the substitutions are made. This hardens developing on that file unnecessarily.

Some solutions to this are discussed on #32, e.g.

Sure, or variations of int value = /* START / 0 / END */; expressions where it is important, and replace both tokens and the content they box.

`nix build github:Kindelia/HVM` fails

Cloning the repo and doing nix build has the same result.
nix develop also errors out with the same error.

$ nix build github:Kindelia/HVM                                        
error: builder for '/nix/store/bgqdhks6dikjfbqmhc7iknx5pflgiar0-hvm-crate2nix.drv' failed with exit code 3;
       last 10 log lines:
       > drwxr-xr-x 11 nixbld nixbld  4096 Jan  1  1970 bench
       > -rw-r--r--  1 nixbld nixbld   320 Jan  1  1970 default.nix
       > drwxr-xr-x  2 nixbld nixbld  4096 Jan  1  1970 examples
       > -rw-r--r--  1 nixbld nixbld  3408 Jan  1  1970 flake.lock
       > -rw-r--r--  1 nixbld nixbld  2961 Jan  1  1970 flake.nix
       > -rwxr-xr-x  1 nixbld nixbld   624 Jan  1  1970 prep.sh
       > -rw-r--r--  1 nixbld nixbld    62 Jan  1  1970 rustfmt.toml
       > -rw-r--r--  1 nixbld nixbld   318 Jan  1  1970 shell.nix
       > drwxr-xr-x  2 nixbld nixbld  4096 Jan  1  1970 src
       > == ls -la (END)
       For full logs, run 'nix log /nix/store/bgqdhks6dikjfbqmhc7iknx5pflgiar0-hvm-crate2nix.drv'.
(use '--show-trace' to show detailed location information)

The output of nix log /nix/store/bgqdhks6dikjfbqmhc7iknx5pflgiar0-hvm-crate2nix.drv is:

@nix { "action": "setPhase", "phase": "unpackPhase" }
unpacking sources
unpacking source archive /nix/store/h3g8skxxp8kazfzk8bbk92v8bryx3wfv-rh2xy9kpb76jc79mlc7gs83qxqhgfras-source
source root is rh2xy9kpb76jc79mlc7gs83qxqhgfras-source
@nix { "action": "setPhase", "phase": "buildPhase" }
building
++ crate2nix generate -f ./Cargo.toml -o Cargo-generated.nix -h /nix/store/rk2yq19vkhgx09fswn2gsm2zyzgmfy34-hvm-crate2nix/crate-hashes.json
Error: while retrieving metadata about ./Cargo.toml: `cargo metadata` exited with an error: error: the lock file /build/rh2xy9kpb76jc79mlc7gs83qxqhgfras-source/Cargo.lock needs to be updated but --locked was passed to prevent this
If you want to try to generate the lock file without accessing the network, remove the --locked flag and use --offline instead.
crate2nix failed.
== cargo/config (BEGIN)
    [source.crates-io]
    replace-with = "vendored-sources"
    
    
    [source.vendored-sources]
    directory = "/nix/store/5wivq9n9cy4fpcywkiir70p7bkkkabyc-deps"

== cargo/config (END)

== crate-hashes.json (BEGIN)
    {}
== crate-hashes.json (END)

== ls -la (BEGIN)
total 132
drwxr-xr-x  7 nixbld nixbld  4096 Jan  1  1970 .
drwx------  3 nixbld nixbld  4096 Mar 31 21:03 ..
drwxr-xr-x  3 nixbld nixbld  4096 Jan  1  1970 .github
-rw-r--r--  1 nixbld nixbld    97 Jan  1  1970 .gitignore
drwxr-xr-x  2 nixbld nixbld  4096 Jan  1  1970 .vscode
-rw-r--r--  1 nixbld nixbld   266 Jan  1  1970 BUILDING.md
-rw-r--r--  1 nixbld nixbld  8695 Jan  1  1970 Cargo.lock
-rw-r--r--  1 nixbld nixbld   612 Jan  1  1970 Cargo.toml
-rw-r--r--  1 nixbld nixbld 35535 Jan  1  1970 HOW.md
-rw-r--r--  1 nixbld nixbld  1068 Jan  1  1970 LICENSE
-rw-r--r--  1 nixbld nixbld   802 Jan  1  1970 NIX.md
-rw-r--r--  1 nixbld nixbld 10783 Jan  1  1970 README.md
drwxr-xr-x 11 nixbld nixbld  4096 Jan  1  1970 bench
-rw-r--r--  1 nixbld nixbld   320 Jan  1  1970 default.nix
drwxr-xr-x  2 nixbld nixbld  4096 Jan  1  1970 examples
-rw-r--r--  1 nixbld nixbld  3408 Jan  1  1970 flake.lock
-rw-r--r--  1 nixbld nixbld  2961 Jan  1  1970 flake.nix
-rwxr-xr-x  1 nixbld nixbld   624 Jan  1  1970 prep.sh
-rw-r--r--  1 nixbld nixbld    62 Jan  1  1970 rustfmt.toml
-rw-r--r--  1 nixbld nixbld   318 Jan  1  1970 shell.nix
drwxr-xr-x  2 nixbld nixbld  4096 Jan  1  1970 src
== ls -la (END)

The formal definition of the runtime is needed.

I want to make an interpreter - so it can fail with a trace of what it tried to do, and one I can play with.

I also want to make a HM-like typechecker (probably with some extensions).

For that, I need a formal definition of whats going on. This looks like Landin's INs, but with "superposition nodes", which - as I understand it - is a good way to not have an oracle which ruins everything. I kinda understand how they work from HOW.md, but not quite.

How are the s and r from {s r} linked to \x., where s and r are clones of b from \x.b?

I also don't get how Era nodes reduce. Do they replace everything they go in with Era nodes, while ~~deleting~~ returning themselves to freelist?

Add Binary format?

In the same vein as Wasm, adding a simple binary format might make HVM an easier/more compact target for language backends. Startup times may also be faster, as there's no need to parse unicode and build up the syntax tree. Thoughts?

As an aside, are you planning on documenting a public interface and releasing on crates.io for use in other Rust projects?

[armh] capacity overflow in raw_vec when interpreting readme example

$ cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

$ cat main.hvm
// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ (Sum a) (Sum b))

// Performs 2^n additions in parallel
(Main n) = (Sum (Gen n))

$ RUST_BACKTRACE=1 hvm r main 3
thread 'main' panicked at 'capacity overflow', library/alloc/src/raw_vec.rs:509:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
   2: core::panicking::panic
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:48:5
   3: alloc::raw_vec::capacity_overflow
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/raw_vec.rs:509:5
   4: hvm::builder::eval_code
   5: hvm::run_code
   6: hvm::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Binary compiled from generated C fails to allocate memory

Reproduction

$ git rev-parse HEAD
ff9ed47b163d6d24fdc6f968d35b95b81deb0969
$ cargo build
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
$ cd bench/Fibonacci
$ ../../target/debug/hvm c main.hvm
Compiled to 'main.c'.
$ clang -g -O2 main.c -o main -lpthread
$ ./main 10
main: main.c:1337: int main(int, char **): Assertion `mem.node' failed.
Aborted
$

The message is from:
https://github.com/Kindelia/HVM/blob/ff9ed47b163d6d24fdc6f968d35b95b81deb0969/src/runtime.c#L1229
But I think the change in 99e3286 is ultimately what's causing it.

C Runtime rationale

Hi there,
this is a really interesting project. I was wondering why the target runtime is obtained transpiling to a C template, instead of directly using rust itself (eg. using Rayon); another option could be targeting something like LLVM.

Thank you

Operations on var-base numbers (and fast `mod N` operations)

In HVM, an interesting way to represent numbers is to use λ-encoded digit-strings of varying bases, parametrized by a list of numbers (a base-list). For example, by using [2,3,5], we represent numbers base 2-3-5:

n = a % 2 + b * 2 % (2 * 3) + c * (2 * 3) % (2 * 3 * 5)

Moreover, since 2 * 3 * 5 = 30, numbers larger than 30 on base 2-3-5 will wrap around. That means operations on them will be mod 30, with no need for a modulus operation. This may help implementing mod N algorithms that fuse, by avoiding calls to mod, which could interrupt fusion.

The base-list can be anything. For example, [100] would give us operations mod 100. But, in turn, each number would use a lot of space, since there are 100 digits. A better idea is to factorize 100 and use [2,2,5,5] instead, which will also give us operations mod 100, but using much less space. Of course, if we use a base-list with identical numbers, say, [2,2,2,2], we'll just have binary numbers, or N-ary, for whatever the N is. Here is the code:

// Applies `f` `n` times to `x`, directly
(Repeat 0 f x) = x
(Repeat n f x) = (f (Repeat (- n 1) f x))

// Given a base-list, applies `f` `n` times to `x`,
// in such a way that is optimized for that base-list
(Apply Nil         n f x) = x
(Apply (Cons b bs) n f x) = (Apply bs (/ n b) λk(Repeat b f k) (Repeat (% n b) f x))

// Given a base-list, applies `f` `n` times to `x`,
// in such a way that is optimized for that base-list
(Times Nil         n f x) = x
(Times (Cons b bs) n f x) = ((TimesGo b b bs n) f x)
  (TimesGo 0 b bs n) = (n λfλx(x))
  (TimesGo i b bs n) = ((TimesGo (- i 1) b bs n) λpλfλx(Times bs p λk(Repeat b f k) (Repeat (- i 1) f x)))

// Given a base, ends a digit-string
(E base) = λend (EGo base end)
  (EGo 0    end) = end
  (EGo base end) = λctr (EGo (- base 1) end)
  
// Given a base, appends `digit` to a digit-string
(D base digit pred) = λend (DGo base digit pred λx(x))
  (DGo 0    n pred ctr) = (ctr pred)
  (DGo base 0 pred ctr) = λctr (DGo (- base 1) (- 0 1) pred ctr)
  (DGo base n pred ctr) = λera (DGo (- base 1) (- n 1) pred ctr)

// Given a base-list, converts a digit-string to a list
(ToList Nil         xs) = Nil
(ToList (Cons b bs) xs) = (ToListGo b bs xs)
  (ToListGo 0 bs xs) = (xs Nil)
  (ToListGo b bs xs) = ((ToListGo (- b 1) bs xs) λp(Cons (- b 1) (ToList bs p)))

// Given a base-list, converts a digit-string to a number
(ToU32 bases xs) = (ToU32Go bases (ToList bases xs) 1)
  (ToU32Go Nil         Nil         m) = 0
  (ToU32Go (Cons b bs) (Cons x xs) m) = (+ (* x m) (ToU32Go bs xs (* m b)))

// Given a base-list, returns the number 0
(Zero Nil        ) = End
(Zero (Cons b bs)) = (D b 0 (Zero bs))

// Giben a base-list, and a u32 `i`, returns the number `n`
(Number bases i) = (Apply bases i λx(Inc bases x) (Zero bases))

// Given a base, applies a function to the predecessor
// (Inj [3] f λeλaλbλc(a pred)) == λeλaλbλc(a (f pred))
(Inj base f xs) = λen (InjGo base f (xs λf(en)))
  (InjGo 0 f xs) = (xs f)
  (InjGo b f xs) = λv (InjGo (- b 1) f (xs λpλf(v (f p))))

// Given a base-list, increments a digit-string
(Inc Nil         xs) = Nil
(Inc (Cons b bs) xs) = λen λn0 (IncMake (- b 1) bs (xs en) λp(n0 (Inc bs p)))
  (IncMake 0 bs xs ic) = (xs ic)
  (IncMake n bs xs ic) = λv (IncMake (- n 1) bs (xs v) ic)

// Given a base-list, increments `b` a total of `a` times
// This is equivalent to addition, and is fast due to fusion
(Add bases a b) = (Times bases a λx(Inc bases x) b)

// Given a base-list, creates an adder for two digit-strings, with carry bits
(AdderCarry Nil         xs) = λys(ys)
(AdderCarry (Cons b bs) xs) = (AdderCarryGo b b bs xs)
  (AdderCarryGo 0 b bs xs) = (xs λys(ys))
  (AdderCarryGo i b bs xs) = ((AdderCarryGo (- i 1) b bs xs) (λxs λys (Repeat (- i 1) λx(Inc (Cons b bs) x) (Inj b (AdderCarry bs xs) ys))))

// Given a base-list, adds two bit-strings, with carry bits
(AddCarry bases xs ys) = ((AdderCarry bases xs) ys)

// FIXME: this is wrong, only works if all bases are the same
(Mul bases xs ys) = (MulGo bases xs λk(Add bases ys k))
  (MulGo Nil         xs add) = End
  (MulGo (Cons b bs) xs add) = (MulDo b b bs xs add)
  (MulDo b 0 bs      xs add) = (xs End)
  (MulDo b i bs      xs add) = ((MulDo b (- i 1) bs xs add) λp(Repeat (- i 1) add (D b 0 (MulGo bs p add))))

(Main x) =
  let bases   = (Cons 2 (Cons 3 (Cons 5 (Cons 7 Nil))))
  let to_list = λx(ToList bases x)
  let to_u32  = λx(ToU32 bases x)
  let times   = λnλfλx(Times bases n f x)
  let apply   = λnλfλx(Apply bases n f x)
  let zero    = (Zero bases)
  let inc     = λx(Inc bases x)
  let add     = λaλb(Add bases a b)
  let addc    = λaλb(AddCarry bases a b)
  let mul_a   = λaλb(Times bases a (add b) zero)  // mul by repeated add by repeated inc
  let mul_b   = λaλb(Times bases a (addc b) zero) // mul by repeated add with carry
  let mul_c   = λaλb(Mul bases a b)               // mul using the incorrect algorithm
  let num     = λi(Number bases i)
  (to_u32 (add (num 123456789) (num 987654321)))

This snippet includes an example of computing 123456789 + 987654321 mod 210, and it efficiently outputs 60, no call to mod is needed. Problems:

How do we implement mul and exp efficiently? The code above has mul, but is wrong (see the FIXME).
Given a function f(x) = a^x mod N, what is the optimal algorithm for finding the period?

Pattern matching on a lambda?

Awesome project, got to it via HN
I thought I'd try to port microKanren to HVM (why not), but it requires matching on a lambda, or a slight change in the semantics

Here's what I came up with until now, it's likely I may have missed a few things or got them wrong

(Car (Cons x y)) = x
(Cdr (Cons x y)) = y
(Vareq (Var v1) (Var v2)) = (== v1 v2)

(ExtS x v s) = (Cons (Cons x v) s)

(EgInner sc NIL) = MZERO
(EgInner sc s) = (Unit (Cons s (Cdr sc)))

(Eg u v) =
    λsc(EgInner sc (Unify u v (Car sc)))

(Unit sc) = (Cons sc MZERO)
MZERO = NIL

(UnifyInner u v NIL) = NIL
(UnifyInner (Var v) (Var v) s) = s
(UnifyInner (Var u) v s) = (ExtS (Var u) v s)
(UnifyInner u (Var v) s) = (ExtS (Var v) u s)
(UnifyInner (Cons hu tu) (Cons hv tv) s) =
    (let s = (Unify hu hv s);
     (Unify tu tv s))
(UnifyInner u u s) = s

(Unify u v s) =
    (UnifyInner (Walk u s) (Walk v s) s)

(Walk u NIL) = u
(Walk (Var u) (Cons (Cons (Var u) v) s)) = (Walk v s)
(Walk (Var u) .) = (Var u)

(CallFresh f) = @sc(let c = (Cdr sc);
                    ((f (Var c)) (Cons (Car sc) (+ c 1))))
(Disj g h) = @sc(Mplus (g sc) (h sc))
(Conj g h) = @sc(Bind (g sc) h)

(Mplus NIL y) = y
(Mplus (Cons h t) y) = (Cons h (Mplus t) y)
(Mplus fn x) = @(Mplus x (fn))

(Bind NIL g) = MZERO
(Bind (Cons h t) g) = (MPlus (g h) (Bind t g))
(Bind fn g) = @(Bind (fn) g)

EMPTY_STATE = (Cons NIL 0)

// uK ends here

(Zzz g) = @sc(@(g sc))

(ConjL (Cons g NIL)) = (Zzz g)
(ConjL (Cons g gs)) = (Conj (Zzz g) (ConjL gs))

(DisjL (Cons g NIL)) = (Zzz g)
(DisjL (Cons g gs)) = (Disj (Zzz g) (DisjL gs))

(Conde (Cons l alts)) = (DisjL (Cons (ConjL l) (Conde alts)))

(Fresh NIL goals) = (ConjL goals)
(Fresh (Cons v lvars) goals) = @v(Fresh lvars goals)

(Pull NIL) = NIL
(Pull (Cons x y)) = (Cons x y)
(Pull f) = (Pull (f))

(Take 0 f) = NIL
(Take n NIL) = NIL
(Take n (Cons x y)) = (Cons x (Take (- n 1) y))
(Take n f)  = (Take n (Pull f))

//

(Fives x) = (Disj (Eg x 5) (Fives x))

(Main) = (let f = (CallFresh Fives);
          (f EMPTY_STATE))

Alpha Equivalent Terms Evaluated Differently

I was playing around with lambda encoded bitstrings and came across this weird behavior. Take the following functions;

Empty = λe λo λi e
(Zero s) = λe λo λi (o s)
(One s)  = λe λo λi (i s)
(Inc b) = 
  λe λo λi 
    (b (o Empty)
       i
       (λx (o (Inc x))))

If we set Main = (One Empty) we get the

λ_ λ_ λx1 (x1 λx2 λ_ λ_ x2)

If we set Main = (Inc (Inc Empty)) we get

λ_ λx1 λx2 (x2 λx3 λ_ λ_ x3)

Which, while being printed a bit differently, is clearly the same term.

If we set Main = (Inc (One Empty)) we get

λ_ λx1 λ_ (x1 λ_ λx2 λ_ (x2 λx3 λ_ λ_ x3))

which is what I expect. If we set Main = (Inc (Inc (Inc Empty))) we should get something alpha-equivalent, but instead we get a stack overflow.

Confusing issue (or possibly error) in HOW under "abusing beta-optimality"

The same technique also applies for Inc. We start with the usual definition:

(Inc E)     = E
(Inc (O x)) = (I x)
(Inc (I x)) = (O (Inc x))

I don't understand this. It looks a lot like you're defining Inc 0 to be 0 in the first line. But surely if you increment 0 you get 1, right? Shouldn't it be

(Inc E) = (O E)

If so it looks a lot like defining numbers in a bijective base-2 scheme. While that is absolutely great and makes perfect sense to me, is it the usual definition? I thought that in lambda calculus they commonly define numbers with unary, i.e. just repeated application.

Improvements needed, and planned features

Better allocator

Currently, the C allocator just reserves 8 GB of memory when the program starts, and, if there are multiple threads, that memory is split between then. So, for example, if there are 8 threads, each one gets 1 GB to work with. A worker can read from other worker's memories (which may happen for shared data, i.e., when they pass through a DP0/DP1 pointer), but only the owner of the slice can allocate memory from it. Each worker also keeps its own freelist.

As such, a huge improvement would be thread-safe, global, possibly arena-based, alloc and free.

Better scheduler

Right now, the task scheduler will just partition threads evenly among the normal form of the result. So, for example, if the result is (Pair A B), and there are 8 threads, 4 will be assigned to reduce A and B. This is obviously very limited: if A reduces quickly, it won't re-assign the threads to help reducing B, so the program will just use half of the cores from that point. And so on. This is why algorithms that return lists are often slower, they aren't using threads at all. In the worst case, it will just fallback to being single threaded.

A huge improvement would be a proper scheduler, that adds potential redexes to a task pool. When I attempted to do that, the cost of synchronization added too much overhead, ruining the performance. Perhaps a heuristic to consider would be to limit the depth of the redexes for which global tasks are emitted; anything below, say, depth=8, would just be reduced by the thread that reaches it. Many improvements can be done, though.

I32, I64, U64, F32, F64

Right now, HVM only supports U32. The numeric types above should be supported too. I32 and F32 should be easy to add, since they are unboxed, like U32. I64, U64 and F64 require implementing boxed numbers, since they don't fit inside a 64-bit Lnk (due to the 4-bit pointer), but shouldn't be hard. Discussion on whether we should have unboxed 60-bit variants is valid.

On #81.

Improve the left-hand side flattener

On #54.

A nice API to use as a Rust library

Using HVM as a pure Rust lib inside other projects should be very easy, specially considering the interpreter has a fairly good performance (only about 50% slower than the single-thread compiled version). We must think of the right API to expose these features in a Rust-idiomatic, future-proof way. Input from the Rust community is appreciated.

A JIT compiler

Right now, HVM compiles to C. That means a C compiler like clang or gcc is necessary to produce executables, which means that it isn't portable when used as a lib. Of course, libs can just use the interpreter, but it is ~3x slower than the compiler, and is not parallel. Ideally, we'd instead JIT-compile HVM files using WASM or perhaps Cranelift, allowing lib users to enjoy the full speed of HVM in a portable way.

IO

Adding IO should be easy: just write an alternative version of normalize that, instead of just outputting the normal form, pattern-matches it against the expected IO type, and performs IO operations as demanded.

Windows support

The compiler doesn't support Windows yet. The use of -lpthreads may be an issue. The interpreter should work fine, though.

On #52.

GPU compilers

Compiling to the GPU would be amazing, and I see no reason this shouldn't be possible. The only complications I can think of are:

How do we alloc inside a GPU work unit?
Reduce() is highly branching, so this may destroy the performance. But it can be altered to group tasks by categories, and merge all branches in one go. The reduce function essentially does this: branch -> match -> alloc -> subst -> free. The last 4 steps are very similar independent of the branch it falls into. So, instead, the branch part can just prepare some data, and the match -> alloc -> subst -> free pass are all moved to the same branch, just with different indices and parameters. This will avoid thread divergence.
GPGPU is a laborious mess. I'm under OSX, so I can't use CUDA; OpenCL is terrible; WebGPU is too new (and from what I could see, doesn't have the ability to read/write from the same buffer in the same pass!?). Regardless, we'd probably need different versions, Metal for OSX, CUDA and OpenCL for Windows depending on the GPU, etc. So, that's a lot of work that I can't do myself, it would require expensive engineers.

Compile Kind to HVM

HVM is meant to be a "low-level target", which is kinda confusing because it is actually very high-level, but in the sense that users shouldn't be writing it directly. Ideally, Kind will compile to HVM, but it needs a lot of adjustments before that is possible.

Add a checker for the "derivatives of (λx(x x) λx(x x)) not allowed" rule

See this issue for more information, as well as the superposed duplication section on HOW.md.

On #61.

Tests

This project has basically no test. Write tests!

Will edit the thread as I think of more things. Feel free to add to this thread.

Clarify 'core language overview'

https://github.com/Kindelia/HVM/blob/master/HOW.md#hvms-core-language-overview

term ::=
  | λvar. term               # a lambda
  | (term term)              # an application
  | (ctr term term ...)      # a constructor
  | num                      # a machine int
  | (op2 term term)          # an operator
  | let var = term; term     # a local definition

rule ::=
  | term = term

file ::= list<rule>

The grammar in the link above exacerbates confusion about the issue in the link below:

#64 (comment)

Also, regardless of this syntax issue, isn't the grammar still technically incorrect because for ctr applications you need a rule, not just a term? Maybe like one of these depending on if you adopt []?

  | (term term)              # a lambda application
  | (Rule term)              # a rule application

or

  | (term term)              # a lambda application
  | [rule term]              # a rule application

Thread-safe synchronization of dup-related rewrite rules

The following program:

(Sum Nil        ) = 0
(Sum (Cons x xs)) = (+ x (Sum xs))

(Mul Nil        ) = 1
(Mul (Cons x xs)) = (* x (Mul xs))

(Main x) =
  let xs = (Cons 1 (Cons 2 (Cons 3 Nil)))
  (Pair (Sum xs) (Mul xs))

Sometimes gives the wrong result when evaluated in parallel, due to synchronization of dup nodes. As I mention on HOW.md, the fact values exist only at one place greatly simplifies parallelism, and most rewrite rules are thread-safe by design, since the thread has full ownership of every memory location it reads/writes to. There is an exception to that rule, though: dup nodes. For example, consider the program below:

dup a b = (Cons 1 (Cons 2 (Cons 3 Nil))))
...

It can happen that two different threads arrive at the same dup node (which is impossible for all other nodes): one coming from the a binder, and other coming from the b binder. This will case both threads to attempt to apply the DUP-CTR rule at the same time, which may result in errors. To avoid this issue, we must make sure that, whenever a thread accesses a dup node from one of its binders (either a or b here), it locks the area, preventing outside threads from accessing it. This was (naively) prevented by using atomic_inc as a lock here, but this is a temporary solution, which was expected to break in some cases, as depicted above. A more robust solution must be carefully designed to ensure that two threads won't access the same dup node at the same time; i.e., once a thread accesses a dup node, it must lock any subsequent access to it, and a thread that attempts to do so must spinlock until the other is done.

Some thoughts from the perspective of a Haskeller while reading https://github.com/Kindelia/HVM/blob/master/HOW.md

Some notes I took while reading your introduction to your exciting project from my perspective as someone familiar with performant GHC-Haskell code. Of interest to me is how exactly everything is mapped to hardware, at least to the point of heap object layout and memory reads/writes. Not everybody might need that much detail, but perf seems like the main selling point of overhauling the machine model of LC languages, so that's what I expect to take away from reading. Later... Looking forward to more details on that in the dedicated section. :)

Clarify the difference between `dup` and reference counted thunks

Once I read (at the end of this section)

Collection is orchestrated by variables that go out of scope.

I immediately thought "Ah, so dup nodes are like reference counted thunks". Is that true? Why isn't it true?
I presume it has something to do with reduction of open expressions (under lambdas). Later... Indeed, superposition means I need to remember which "reference count" referenced which entry in the superposition. So there is more structure. I wish that was mentioned earlier, so that I knew where to look for the difference.

Better example for Lambda deduplication

I find the example at the end of https://github.com/Kindelia/HVM/blob/master/HOW.md#lambda-duplication a bit unconvincing, because no optimising compiler would leave behind an obvious redex like (λx. ...) 2. GHC would start with

dup f g = (λy (Pair 4 y)
(Pair (f 10) (g 20))

to begin with. I know that there are probably better examples with open expressions (otherwise, what's the point?), perhaps one with an expression under a non-redex lambda that can't be floated out because it mentions the lambda-bound variable or some such (no idea, really). Giving one of those would be much more informative!

In other words, I think this section should mention the "killer example".

Bonus: Abusing Beta-Optimality

What follows is more like a few loose connections I can make as someone familiar with the way GHC-compiled programs work.

The takeaway of this section for me seems to be that we want to eta-expand all definitions as much as possible. In GHC that is generally true, too, but problematic if doing so destroys sharing (for example, when expanding over a shared let binding, effectively floating it inside a lambda). Are there similar cases in which sharing is destroyed, due to eagerly dupped lambdas?

For example, a clever way to implement a Data.List would be to let all algorithms operate on λ-encoded Church Lists under the hoods, converting as needed. This has the same "deforestation" effect of Haskell's rewrite pragmas, without any hard-coded compile-time rewriting, and in a more flexible way.

In fact, I've argued that GHC's rewrite RULEs from [a] to fusion helpers should be made more explicit by introducing a data type

newtype FB a = FB { unFB :: forall b. (a -> b -> b) -> b -> b }

mapFB :: (a -> b) -> FB a -> FB b
...

rather than having the helpers scattered about like this. Indeed, if you only used FB everywhere we use lists today, we wouldn't need any rewrite rules at all.

For example, using map in a loop is "deforested" in HVM. GHC can't do that, because the number of applications is not known statically.

Indeed, mapFB f . mapFB g is optimised to mapFB (f . g), too by GHC. I presume that you mean that Interaction Nets support these optimisations dynamically, for example if mapFB g is passed as some function parameter x and we have mapFB f . x in the body. That indeed seems like it has an edge over what GHC does!

Note that too much cloning will often make your normal forms large, so avoid these by keeping your programs linear.

Yes, this seems like the loss of sharing due to excessive eta-expansion I talk about above. Note also that transforming the arity 2 function into the arity 1 function is unsound, as it makes Add stricter: Add undefined `seq` () diverges after the transformation. So the programmer has to write the arity 1 definition from the beginning.

Implementation

Judging from what reduce does, it seems to me that the implementation is more like a bytecode interpreter because of the lack of a compile-time mapping from Lambdas to actual machine code. Do you have ideas for how such a mapping might work? Similar to how "graph reduction" transitioned to today's "supercombinator" encodings (which is just the familiar lambda=<static link, code ptr> encoding again).

Deadlock/livelock/etc. prevention

I saw the implementation of u64 normal_join(u64 tid) {...} in runtime.c and its use and wonder whether the HVM language (or is it HOVM language?) guarantees no deadlocks/livelocks in this case?

If so, how? If not, then this should be taken into consideration if I understood the inner workings correctly.

I'm unable to do any mathematical proofs of this, so I've turned to asking about this 😉.

[Suggestion] Debugger

It would be nice to have some form of debugger for HVM - with ability to stop at breakpoints to see what expressions are the names bound to at the moment.

refactor CLI parsing with `clap`

as in:
https://github.com/Kindelia/Kind2/blob/master/src/main.rs#L25
https://github.com/Kindelia/Kind2/blob/master/src/cli.rs

"List Fold (Sequential)" Haskell benchmark isn't entirely faithful

I appreciate the work and thought you put into this project! It got me quite excited, to be honest; now I really want to know if/how interaction nets can actually be mapped to stock hardware efficiently.

But it appears to me that the benchmarks comparing to GHC-compiled Haskell aren't entirely faithful. You say

Since lists are sequential, and since there are no higher-order lambdas, HVM doesn't have any technical advantage over GHC.

That's technically true, but it's also not the kind of code you'd write in Haskell. For one, the tail-recursive nature of range means that you really have to build up a list that has to be copied during GCs. But Haskell (and HVM) is a lazy language and I'd rather write a generative definition of range, like

range 0 = Nil
range n = Cons n (range (n-1))

And with this definition, the entire program operates in constant space (modulo GC) due to the perks of laziness. Here's the complete program:

import System.Environment
import Data.Word

data List a = Nil | Cons a (List a)

-- Folds over a list
fold Nil         c n = n
fold (Cons x xs) c n = c x (fold xs c n)

-- A list from 0 to n
range 0 = Nil
range n =
  let m = n - 1
  in Cons n (range m)

-- Sums a big list with fold
main = do
  n <- read.head <$> getArgs :: IO Word32
  let size = 1000000 * n
  let list = range size
  print $ fold list (+) 0

This program takes about 1s for n=30, the same as the original program takes for n=10 for me (and copies 1.6GB during GC). So about a factor 3 speedup.

Interaction nets might enjoy similar speedups, but I don't know that for sure. So it seems like a bit of an apples to oranges comparison.

Ultimately, making use of foldl' and list fusion on [0..size-1::Int] (Word32 is broken) means another factor 100 of a speedup, of which I similarly don't know how it carries over to interaction nets.

Basic line-oriented IO loop

I think we could get a lot of mileage out of HVM already by exposing stdio through a minimalist line-oriented IO primitive that might look like this.

(IO Input (State s) (Line c s)) = (IO Output (YourProgram (State s) (Line c s)))
(YourProgram state inputline) = ((State ...) (Line ...))

Here I write out (Line c s) to show it is a cons list of characters (atoms). (State s) is I guess a state burrito.

A line is a zero-terminated byte list representing an ascii string.
Then we can pipe into stdin and out of stdout and have some useful programs.

Motivation

I wonder what's the motivation behind rewriting the VM in Rust? Don't you lose one of the major selling points (verifiable VM)?

Use crossbeam for scheduling work

There's no reason you'd have to write a full scheduler yourself.
Crossbeam can do at least some of the work for you.

It has a work stealing deque and message passing essentials.

Is there a need for an LSP in HVM?

Probably it would need a better error handling. The ideal would be something like Rust errors messaging ( or better ).

In my view there is two approachs for a LSP. Creating something on top of the exposed api for handling error. Or creating a checker from zero. The later is not great, because this means more code to maintain.

Windows support

Right now the generated C code can't be compiled on Windows due to the use of <pthreads.h> etc.

One way it could be done would be to compile it using a compatibility layer like Cygwin, but I never did this before.

Help from some Windows programmer would be appreciated. :3

Theoretical background of HOVM?

I'd like to confirm the theoretical background of HOVM.

I've read;
#14 (comment) @leonardohn

But a target is in fact a machine. Take for instance the LLVM compilation targets like X86, ARM, SPIR-V or even WASM; They all resemble computer architectures that are either real hardware (as in X86 and ARM) or a virtual hardware (like SPIR-V or WASM). I agree that historically virtual machines are usually related to traditional hardware (sharing characteristics like using von Neumann-like architectures), but in this particular case, we are talking about an abstraction that doesn't have memory at all, just like original lambda calculus when compared to Turing machines. So, it doesn't even make sense talking about garbage collection inside this machine, unless when considering the implementation of this abstraction over bare metal hardware, which in turn also don't require garbage collection.

This is extremely interesting, especially;

in this particular case, we are talking about an abstraction that doesn't have memory at all, just like original lambda calculus when compared to Turing machines.

My assumption right now is this VM is a successor of

parallel_lambda_computer_tests

optlam -> Absal -> Symmetric-Interaction-Calculus
The Symmetric Interaction Calculus
->
EA-NET +
Elementary Affine Core (EA-CORE) (Specification)

FM-NET +
FM-CORE (Formality-Core)
Formality-Core/Whitepaper.md
Soonad/Whitepaper

->
HOVM

Am I correct?
Please let me know.

Thanks.

Black box testing

Python script to run blackbox tests on HVM programs.
e.g.
https://github.com/quleuber/HVM/tree/tests/test

HVM eats all my RAM and then segfaults

Here's a lambda term I have used in the past as a benchmark for my own lambda calculus interpreter. My interpreter runs it in 3.5s, GHC runs it in 2.5s, and neither goes past the minimum heap size of a few megabytes.

When I run it in HVM, it gradually allocates more and more memory, finishing up with around 8.5 GB and then segfaulting.

Steps to reproduce:

$ cat > bench.hvm <<EOF
// Computes 0 by subtracting a number from itself    
Main = ((λbig λsub (sub big big))
  // the big number, 43'046'721
  ((λs λz (s(s(s(s z))))) (λs λz (s(s z))) (λs λz (s(s(s z)))))
  // the subtraction algorithm
  (λn λm λs λz
    (n (λy λk (k (s (y (λa λb a))) y))
       (λk (k z (λ_ z)))
       (m (λk λa λb (b k)) (λa λb a)))))
EOF
$ cargo run -- c bench.hvm
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/hvm c bench.hvm`
Compiled to 'bench.c'.
$ gcc -o test -pthread -O2 bench.c
$ ./test
Segmentation fault (core dumped)

Code crashes with segfault

First of all, thanks for this very interesting project!

I tried some of the examples, and the last one about lambda arithmetic:

// Increments a Bits by 1
(Inc xs) = λex λox λix
  let e = ex
  let o = ix
  let i = λp (ox (Inc p))
  (xs e o i)

// Adds two Bits
(Add xs ys) = (App xs λx(Inc x) ys)

// Multiplies two Bits
(Mul xs ys) = 
  let e = End
  let o = λp (B0 (Mul p ys))
  let i = λp (Add ys (B0 (Mul p ys)))
  (xs e o i)

// Squares (n * 100k)
(Main n) =
  let a = (FromU32 32 (* 100000 n))
  let b = (FromU32 32 (* 100000 n))
  (ToU32 (Mul a b))

just gives me a segfault:

$ hvm c main && clang -O2 main.c -o main -lpthread && ./main 25
Compiled to 'main.c'.
Segmentation fault (core dumped)

Tested with commit 1da2875.

Furthermore, the QuickSort example just outputs:

(Sum (QSort (Pivot) (Pivot) (Randoms 1 2500000)))

Probably because neither Randoms, Sum, nor Pivot are defined ...

Now back to studying HOW.md ... ;)

Checker for the derivatives of `(λx(x x) λx(x x))`

If a lambda that clones its argument is itself cloned, then its clones aren't allowed to clone each-other.

See #44 for more information, as well as the superposed duplication section on HOW.md.

Example of what is not allowed:

let f = λs λz (s (s z)) # a lambda that clones its argument
let g = f               # a clone of that lambda
(g f)                   # the clone will clone the original

A Bunch of Thoughts

These aren't problems with this implementation. They're just ideas that I wanted to dump down somewhere. Feel free to close this after reading it.

Compiling Without Continuations describes a way to extend lambda calculus with join points. In GHC, this is used internally, but the end-user is unable to explicitly request that a let-bound lambda be compiled to a join point. It can be nice to have this be explicit so that it is easier to tell if a function allocates. Is it possible to extend HVM's calculus compatible with join points? Is it desirable?
This came up on reddit, but the idea of exposing mutable arrays with a linear interface is interesting.
As an alternative to mutable arrays, just having immutable arrays, but small, no more than than 16 elements or so. Even with just small arrays, it's possible to build persistent arrays as a tree-like structure. But the small arrays, along with indexing and update functions, would need to be built in. GHC doesn't let you unpack an array into a data constructor. Maybe HVM could. There are a bunch of data structures that can use this to improve cache coherence.

(temporary) flag to set quantity of allocated memory

e.g.

hvm compile -M 2G main.hvm

README time complexity wording

To quote from the README as of b95388f:

This chart isn't wrong: HVM is exponentially faster for function composition, due to optimality, depending on the target function. There is no parallelism involved here. In general, if the composition of a function f has a constant-size normal form, then f^(2^N)(x) is constant-time (O(N)) on HVM, and exponential-time (O(2^N)) on GHC. This can be taken advantage of to design novel functional algorithms. I highly encourage you to try composing different functions and watching how their complexity behaves. Can you tell if it will be linear or exponential? Or how recursion will affect it? That's a very insightful experience!

The bolded section of text (constant-time (O(N))) doesn't make sense - I think it's meant to say linear time.

Generated C code runs forever

@VictorTaelin

Just to be sure, all works fine there?

Hmm, I'm actually not sure. I tried one of the benchmarks which ended up compiling ok, with only the following warning:

bash-5.1$ clang -O2 compose_id.hovm.out.c -o main -lpthread
compose_id.hovm.out.c:974:23: warning: incompatible pointer to integer conversion assigning to 'Thd' (aka 'unsigned long') from 'void *'
      [-Wint-conversion]
    workers[t].thread = NULL;
                      ^ ~~~~
1 warning generated.

But when I try to run the resulting binary, it doesn't seem to finish running, instead seemingly getting stuck:

bash-5.1$ ./main
Reducing.

It hasn't printed anything after that new line for > than 30 minutes (I ended up killing it). Not sure if it's actually not working properly or if it's just my slow CPU though. When I looked at my process manager, the process didn't seem to be using any CPU cycles either.

What is your CPU

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               AuthenticAMD
  Model name:            HP Hexa-Core 2.0GHz
    CPU family:          22
    Model:               48
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         2000.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            3992.48
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
                         t pdpe1gb rdtscp lm constant_tsc rep_good acc_power nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3
                         cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
                         ch osvw ibs skinit wdt topoext perfctr_nb bpext ptsc perfctr_llc cpb hw_pstate ssbd vmmcall bmi1 xsaveopt arat npt lbrv svm_lo
                         ck nrip_save tsc_scale flushbyasid decodeassists pausefilter pfthreshold overflow_recov
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Originally posted by @nothingnesses in #15 (comment)

More numeric types

Right now, HVM only supports U32. The numeric types above should be supported too. I32 and F32 should be easy to add, since they are unboxed, like U32. I64, U64 and F64 require implementing boxed numbers, since they don't fit inside a 64-bit Lnk (due to the 4-bit pointer), but shouldn't be hard. Discussion on whether we should have unboxed 60-bit variants is valid.

I'd be interested in adding support for at least the unboxed numeric types as this seems like a low hanging fruit. I just wanted to check what the concensus was between implicit/explicit casting and how operators should function in regards to mixed input types. My instinct would just be to cast up to the 'bigger' type but maybe operators should only operate on the same type - combined with explicit casts.
@VictorTaelin what do you think?

Originally posted by @samtoth in #38 (comment)

Incorrect result on some inputs

HVM produces incorrect results on some inputs, one of which being:

(Main arg) = ((λa (a (λb ((λc (λd (b (c d)))) a)))) (λe (e e)))

On this input, HVM produces the result λx1 λx2 ((x1 x2) (x1 x2)) while the correct result is λx1 λx2 ((x1 x1) (x2 x2)) (as verified by https://crypto.stanford.edu/~blynn/lambda/, with the corresponding input syntax by adding dots after the variable of each lambda: ((λa. (a (λb. ((λc. (λd. (b (c d)))) a)))) (λe. (e e)))).

Besides, on other inputs such as:

((λa λb ((b a) λc λd (λe (d a) a)) λf (f f)) λz z)

HVM produces the result λx1 (x1 λx2 (x2 _)) which is not even well-formed.

Typo in HOW.md

https://github.com/Kindelia/HVM/blob/14a02de71bd244247abc12749bbc37461feee33d/HOW.md?plain=1#L572

When introducing Dup-Sup, at one point you use the old notation for superposition (<r s>), but everywhere else the other style is used. This should read dup a b = {r s} if I'm not mistaken.

Instantiating constructor into application position does not work intuitively

(App f x) = (f x)
(Succ x) = (+ x 1)
(Main) = (Pair (App Succ 0) (App @x(Succ x) 0))

This HVM code reduces (Main) to (Pair ((Succ) 0) 1). I would intuitively expect (Pair 1 1).

My understanding for this is that in (App Succ 0), the expression Succ is actually shorthand for (Succ), so we have (App (Succ) 0) which reduces to ((Succ) 0) which does not reduce.

Would this be considered an issue, or just a gotcha for newcomers?

C code portability

As of now, compiling the generated code with clang on Linux or with the -pedantic flag gives some warnings, mostly related to constants, e.g.

./main.c:341:10: warning: expression is not an integer constant expression; folding it to a constant is a GNU extension [-Wgnu-folding-constant]
    case DP0: {
         ^~~

Also, compiling with tinycc fails, as per the quoted comment bellow.

Fixing this we can move forward to try embedding a C compiler (clang? tcc? zig cc?).

Please feel free to comment other platforms/compilers that you think should be considered.

fix constant related warnings
test compilation with clang/gcc/tcc
- on MacOS
- on Linux

Q's about theoretically not using lambdas in HVM programs

I think that I can program anything using just type constructors (and pattern matching / destructuring) and not use lambdas or lambda application at all.

Is this true? Is there an example of something that is not possible or is fundamentally more complex to express without lambdas (beyond just code aesthetics)?
I remember in the docs reading that type constructor application is more efficient than lambda application. Is there any kind of optimization that could be done if it is known that there are no lambdas at all, or is it already optimal if I simply don't use them?

kindelia / hvm Goto Github PK

hvm's People

Stargazers

Watchers

Forkers

hvm's Issues

Reproduction

Better allocator

Better scheduler

I32, I64, U64, F32, F64

Improve the left-hand side flattener

A nice API to use as a Rust library

A JIT compiler

IO

Windows support

GPU compilers

Compile Kind to HVM

Add a checker for the "derivatives of (λx(x x) λx(x x)) not allowed" rule

Tests

Clarify the difference between dup and reference counted thunks

Better example for Lambda deduplication

Implementation

Recommend Projects

Recommend Topics

Recommend Org

Clarify the difference between `dup` and reference counted thunks