Comments (7)
I disagree this isn't the kind of code you write in Haskell, in general. Yes, for the problem folding over a list, specifically, it is possible to optimize it to a constant-space loop... in some cases. But that doesn't apply to basically any other Haskell function, which rely a lot on deep recursion. Even fold itself sometimes can't be turned into a constant-space loop. For example, if it was a subtraction, we couldn't use foldl
, because -
isn't associative.
That said, purpose of this benchmark was to stress-test how efficient each engine is at recursing, pattern-matching and allocating. If you replace that by a function that does no recursion, then it wouldn't be benchmarking how well the engine recurses, but instead, how well it deals with tight loops. I've tested, and HVM performs about ~3x slower than GHC in this case, which I guess is expected, because HVM does nothing to optimize these tight loops... yet! But it absolutely could, in which case both obviously would perform the same, since they'd generate the same low-level code.
Remember, HVM is not optimized! It is the first version of a prototype, compared to a mature state-of-art compiler. It should be much slower than GHC, since devs put decades of effort into optimizing it. For example, GHC has an arena-based allocator, which makes great use of cache. HVM doesn't. Yet, it does perform as well as GHC on recursive functions. That's actually surprising. Think of how well it would perform if it had a fraction of GHC's total men-hours dedicated into optimizing it? That's the message I'm trying to pass.
from hvm.
What I'm saying is: HVM is not an optimizer, it is just the runtime. So, if you want a fair comparison, it is STG vs HVM, not GHC vs HVM. A (kinda) reasonable way to do it is to compile Haskell with GHC, and then run the resulting program on both GHC and HVM. Are we on the same page here?
Yes, I agree with that. I'd even be fine with looking at the STG code, identifying some code patterns that you wouldn't generate if HVM was the target and replace them with more perf-aware patterns, playing the role of an HVM-backend of GHC.
With that in mind, your comparison isn't entirely faithful again, because you are using Word32
instead of Int
, which, as I said in the OP, has a broken Enum
instance that doesn't allow fusion:
Ultimately, making use of foldl' and list fusion on [0..size-1::Int] (Word32 is broken) means another factor 100 of a speedup
For Word32
and Int32
you get STG that is very different to the Go
function I gave.
Fortunately, the issue is already fixed on GHC master (and I believe will also be fixed in GHC 9.0.3 and 9.2.2). So there are multiple ways in which you can arrive at good code:
- Use GHC master
- Replace
Word32
withWord
(which seems to be even faster thanInt
) - Replace
Word32
withInt
So, for example
import Data.List
import Data.Word
-- Sums a big list with fold
main = do
let size = 100000000 :: Word
let list = [0..size-1]
print $ foldl' (+) 0 list
executes in 50ms on my machine.
Heck, we can smiply skip the whole list fusion shenigans and just benchmark a counting loop in Haskell, too. I'll simply translate your Go
function back to Haskell:
import Data.Word
go :: Word32 -> Word32 -> Word32 -> Word32
go size i s = aux (i == size) size i s
where
aux True size i s = s
aux False size i s = go size (i + 1) (s + i)
main = do
print $ go 100000000 0 0
Which takes 72ms on 8.10.7 (which has pretty bad sized Word support because of recently added sized primops) and 26ms on GHC master (which is the same perf as with Word
).
If I want to evaluate this 8 times, like
import Data.Word
go :: Word32 -> Word32 -> Word32 -> Word32
go size i s = aux (i == size) size i s
where
aux True size i s = s
aux False size i s = go size (i + 1) (s + i)
main = do
print $ go 100000000 0 0
print $ go 100000000 0 1
print $ go 100000000 0 2
print $ go 100000000 0 3
print $ go 100000000 0 4
print $ go 100000000 0 5
print $ go 100000000 0 6
print $ go 100000000 0 7
it takes 580ms on 8.10.7 and 200ms on GHC master. That's way faster still than your parallelised program. Now you say
I wouldn't even care if it was 10x slower in single-thread, no-allocations, non-optimal scenarios, like that one. HVM is cool because it allows automatic parallelism, no garbage-collection and beta-optimality.
If you want to focus on auto parallelisation, you might like https://wiki.haskell.org/GHC/Data_Parallel_Haskell where they tried something very similar. Meanwhile, all the DPH stuff had been removed from GHC again, because nobody was there who used it or maintained it. As always, the cases where you can't make use of your cores through (superior, because requires less sync) data/task paralellism are indeed rare and at some point the synchronisation overhead dominates. I wouldn't expect HVM to reach 200ms on this benchmark even with 80 cores, simply because of contention issues. So "50x slower" indeed matters.
I don't buy the No GC claim, because effectively the dup
nodes take the role of reference counters (generalised, of course!) and that is a form of GC.
Beta optimality is indeed a nice and flat win. But I wonder on what workloads it really matters compared to STG-compiled programs that GHC produces (e.g., making use of transformations such as FloatOut/full laziness, simplification, eta expansion, ...). And obviously, I wonder what the costs are that pay for the win. It seems there are quite severe losses for sequential code.
This is actually the question that matters most to me: Is Beta optimality worth the losses? Will the losses get significantly smaller over time?
That said, I'm not sure why you find this benchmark useful at all. It does no allocations, but allocations and garbage collection play a major role on functional runtimes. I don't know of any Haskell program that doesn't do a lot of allocation, so, this example isn't representative of real-world usage.
That allocations play a major role in functional programs is true; but we have to compare apples to apples somehow, right?
And it seems we just established that a simple counting loop will regress by a factor of ~50 when switching to a hypothetical HVM backend for GHC (as implemented at the moment! Not saying that won't change. But I still wonder how). This is the kind of hard and honest data I'm interested in! At the same time, there are cases and programs where GHC performs notoriously bad because of churn on the GC during build up of huge data structures. In these cases, HVM seems to be nearly on par. Your main README makes it sound like that is the common case for functional programs. I claim it is not; at some point the data structure is built up and ends up in Gen 1, where we will seldomly trace it during Gen 0 GC.
Besides, Haskell often has "fast non-allocating hot loops" such as repeated (Int)Map.lookup
over the same map. And I think due to the dup
ping that must be involved (e.g., for N lookups over the same map, the root would have to be dup
ped N times, correct?), that'd suffer quite badly when compiled to HVM.
Indeed, all benchmarks you give build up a data structure and then consume it exactly once. Maybe compare examples where the same list or tree is traversed N times and benchmark that.
I stand by my initial claim that the README suggests a false balance, namely that HVM
- is almost on par with GHC's (STG to native code) backend for sequential workloads
- faster when autoparallelisation is on
- and even faster when it manages to make an asymptotic difference due to beta optimality
(2) doesn't excite me very much TBH and we can have it (had it) in Haskell, too (others might be excited by it), (3) is nice to have, just as list fusion, and I think (1) is simply not true (at the moment!), as the counting loop example demonstrates.
from hvm.
I'm not sure if we're talking about the same thing, but seems like we are. What I'm saying is: HVM is not an optimizer, it is just the runtime. So, if you want a fair comparison, it is STG vs HVM, not GHC vs HVM. A (kinda) reasonable way to do it is to compile Haskell with GHC, and then run the resulting program on both GHC and HVM. Are we on the same page here? If so, you said that the following Haskell program:
import Data.List
import Data.Word
-- Sums a big list with fold
main = do
let size = 100000000
let list = [0..size-1] :: [Word32]
print $ foldl' (+) 0 list
Is compiled to the equivalent of:
(Go size i s) = (Go.aux (== i size) size i s)
(Go.aux 1 size i s) = s
(Go.aux 0 size i s) = (Go size (+ i 1) (+ s i))
(Main n) = (Go 100000000 0 0)
Correct? So that'd be a fair benchmark. On my machine, this takes 1.631s
on HVM (and a total of just 72 bytes allocated), and 0.934
on GHC. So, yes, on this case, GHC is 1.74x faster, which isn't surprising given it is, obviously, considerably more optimized, right? So, if these are the numbers you're interested on, there they are.
That said, I'm not sure why you find this benchmark useful at all. It does no allocations, but allocations and garbage collection play a major role on functional runtimes. I don't know of any Haskell program that doesn't do a lot of allocation, so, this example isn't representative of real-world usage. foldr
is more representative precisely because it involves allocations and garbage collection, which aren't always avoidable. If you argue that foldr is unfair to Haskell just because GHC spends 88% of the time collecting garbage, then this just reinforces the importance of not needing a garbage collector!
Even if HVM somehow could not be optimized to be on par with GHC on this example, which is a very unrealistic assumption, keep in mind the main point of HVM isn't to beat GHC in single-core, but to do so when either parallelism or optimality can be exploited. For example, if the algorithm above is repeated 8 times:
import Data.List
import Data.Word
-- Sums a big list with fold
main = do
let size = 100000000
let list = [0..size-1] :: [Word32]
print $ (
foldl' (+) 0 list,
foldl' (+) 1 list,
foldl' (+) 2 list,
foldl' (+) 3 list,
foldl' (+) 4 list,
foldl' (+) 5 list,
foldl' (+) 6 list,
foldl' (+) 7 list
)
VS
(Go size i s) = (Go.aux (== i size) size i s)
(Go.aux 1 size i s) = s
(Go.aux 0 size i s) = (Go size (+ i 1) (+ s i))
(Main n) = (Tuple
(Go 100000000 0 0)
(Go 100000000 0 1)
(Go 100000000 0 2)
(Go 100000000 0 3)
(Go 100000000 0 4)
(Go 100000000 0 5)
(Go 100000000 0 6)
(Go 100000000 0 7)
)
Then it takes just 2.359s
on HVM, versus 7.656s
on GHC, making HVM 3.24x faster.
This isn't even the point though, and I am not selling HVM as faster than GHC in general (even though many times it is). That is written right on the README. I wouldn't even care if it was 10x slower in single-thread, no-allocations, non-optimal scenarios, like that one. HVM is cool because it allows automatic parallelism, no garbage-collection and beta-optimality. These things alone make it a very powerful tool to the functional programming community, and I'm not sure why you seem upset with it?
from hvm.
Think of how well it would perform if it had a fraction of GHC's total men-hours dedicated into optimizing it?
Well, that is actually what I'm trying to find out. TBH, I could speed up the program by another factor of 10 if I made use of list fusion, entirely eliminating all sources of allocation. I wonder if these factors carry over to HVM, too, if someone put in the work to implement a few foldr/build fusion primitives (which shouldn't be too hard). It's hard to tell, because as you say it obviously hasn't seen as much careful work put into it.
But at the same time, you can write programs in basically any language and instantly be in the realm of "a factor of 30 within the perf of C/GHC/whatever". The question is if you can ultimately deliver on the remaining constant factors. Pretending like those constant factors don't exist and say "we can do the same later on" is quite hand-wavy. Plus, speedups in GHC Haskell that are simply due to less GC pressure can't possibly carry over to HVM, because it doesn't do any GC (although it does a variant of reference counting, so that might count as GC after all). So it's a bit unproductive to do the comparison with a program that has a huge and continually growing memory residency, which is kind of a worst case for non-incremental, generational GC and still claim that the program is a workload that GHC should excel at.
I would much rather see "This is the most efficient way in GHC Haskell to solve this problem, and this is the most efficient way to solve it as an HVM program". And if GHC Haskell is 10 times faster in some cases, so be it! You can continually chip off the delta as you improve your VM implementation.
from hvm.
Well, that is actually what I'm trying to find out. TBH, I could speed up the program by another factor of 10 if I made use of list fusion, entirely eliminating all sources of allocation. I wonder if these factors carry over to HVM, too, if someone put in the work to implement a few foldr/build fusion primitives (which shouldn't be too hard). It's hard to tell, because as you say it obviously hasn't seen as much careful work put into it.
Note that compile-time foldr/fusion primitives are not required to have fusion in HVM. It happens at runtime. Now, compile-time inlining is a different thing, and, yes, aggressive inlining can result in faster programs, essentially because you moved work to compile-time. HVM, once mature enough, could optimize foldl'
exactly like GHC. There is no reason not to. Simple proof: compile HVM to Haskell, compile Haskell to Core (with GHC), compile Core back to HVM. That is why benchmarking foldl' is kinda pointless. GHC is master of compile-time computation. HVM does none of that. It isn't a whole compiler. It is just the runtime. Technically it replaces STG, not GHC. In fact, compiling Haskell to HVM via GHC is one way to use it!
I would much rather see "This is the most efficient way in GHC Haskell to solve this problem, and this is the most efficient way to solve it as an HVM program".
I think you do not realize though that HVM operates basically identically to GHC until there are high-order lambdas involved. For programs like sum (range 0 n)
, optimal evaluation and lazy evaluation are the same, because this program makes no use of runtime closures. It can perfectly compile to a direct C loop, and both GHC and HVM can do that.
Does that make sense?
from hvm.
HVM, once mature enough, could optimize foldl' exactly like GHC. Simple proof: compile HVM to Haskell, compile Haskell to Core (with GHC), compile Core back to HVM.
That's not a proof at all. But we agree on the property to assert! Note that GHC doesn't end with Core.
I think you do not realize though that HVM operates basically identically to GHC until there are high-order lambdas involved. For programs like sum (range 0 n), optimal evaluation and lazy evaluation are the same, because this program makes no use of runtime closures. It can perfectly compile to a direct C loop, and both GHC and HVM can do that.
sum (range 0 n)
doesn't have any "higher-order lambdas" (not sure what that means); just regular lambdas. It is a recursive consumer sum
called on a recursive producer range
, the latter of which has to build up the whole data structure before it can do anything useful. At least in the way GHC compiles it; it makes no attempt to turn a tail-recursive function into a generative corecursion (simply because that has different perf trade-offs and is not always beneficial).
As I said, your benchmark capitalises on that by a factor of 3. Perhaps because HVM is able to do the necessary conversion somehow. Or, much more likely, it is faster because it doesn't have a tracing GC. Note the amount of garbage your benchmark produces
$ ./slow 10 +RTS -s
2280707264
726,224,016 bytes allocated in the heap
1,539,414,664 bytes copied during GC
321,709,520 bytes maximum residency (10 sample(s))
55,102,000 bytes maximum slop
730 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 683 colls, 0 par 0.405s 0.406s 0.0006s 0.0015s
Gen 1 10 colls, 0 par 0.567s 0.567s 0.0567s 0.2281s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.131s ( 0.130s elapsed)
GC time 0.973s ( 0.973s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.104s ( 1.104s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 5,550,184,596 bytes per MUT second
Productivity 11.9% of total user, 11.8% of total elapsed
12% Productivity. Or in other words, 88% of the execution time is spent doing garbage collection! Your are merely benchmarking the garbage collector!
I agree that GHC is miles ahead in terms of possible optimisations. But note that the argument I gave is quite similar to your argument of "But GHC does so much more optimisations, it's unfair to judge HVM by comparing to a Haskell program that is written with performance in mind".
I don't think we'll get to an agreement in the argument above. But maybe we can agree on the following:
Given a faithful reproduction of optimised STG/Core of program X.hs produced by GHC in HVM code, how much faster/slower is the HVM code to execute compared to the GHC-compiled code?
And then in theory, we wouldn't care about whether X.hs would be written with perf in mind or not, right? Since perf is important, I'd like to write X.hs in a way that is particularly efficient and not dominated by allocations. Hence I pick
import System.Environment
import Data.List
-- Sums a big list with fold
main = do
n <- read.head <$> getArgs :: IO Int
let size = 1000000 * n
let list = [0..size]
print $ foldl' (+) 0 list
and see
/fast 10 +RTS -s
50000005000000
57,288 bytes allocated in the heap
3,312 bytes copied during GC
44,408 bytes maximum residency (1 sample(s))
25,224 bytes maximum slop
2 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.012s ( 0.012s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.012s ( 0.012s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,909,298 bytes per MUT second
Productivity 95.3% of total user, 95.4% of total elapsed
which is 100 times as fast as the previous example and produces the correct 64 bit result. Now I want to take the inner loop there and translate it to HVM code. Just FTR, -O2 -ddump-final-stg
produces something like the below for me.
Main.main1
:: GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
[GblId, Arity=1, Str=<L,U>, Unf=OtherCon []] =
\r [void_0E]
case
Foreign.Marshal.Alloc.allocaBytesAligned
Foreign.Storable.$fStorableBool7
Foreign.Storable.$fStorableBool7
System.Environment.getArgs1
GHC.Prim.void#
of
{
Unit# ipv1_s390 [Occ=Once1!] ->
let {
sat_s39v [Occ=Once1, Dmd=<L,1*U>] :: GHC.Base.String
[LclId] =
\s []
let {
sat_s394 [Occ=Once1] :: GHC.Base.String
[LclId] =
\u []
case ipv1_s390 of {
[] -> GHC.List.badHead;
: x_s392 [Occ=Once1] _ [Occ=Dead] -> x_s392;
};
} in
case
Text.ParserCombinators.ReadP.run Main.main4 sat_s394
of
sat_s395 [Occ=Once1]
{
__DEFAULT ->
case Text.Read.readEither8 sat_s395 of {
[] -> Main.main3;
: x_s398 [Occ=Once1!] ds1_s399 [Occ=Once1!] ->
case ds1_s399 of {
[] ->
case x_s398 of {
GHC.Types.I# y_s39c [Occ=Once1] ->
case *# [1000000# y_s39c] of y1_s39d {
__DEFAULT ->
case ># [0# y1_s39d] of {
__DEFAULT ->
let-no-escape {
Rec {
$wgo_s39f [InlPrag=NOUSERINLINE[2], Occ=LoopBreakerT[2]]
:: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Base.String
[LclId[JoinId(2)],
Arity=2,
Str=<L,U><L,U>,
Unf=OtherCon []] =
\r [w_s39g ww_s39h]
case ==# [w_s39g y1_s39d] of {
__DEFAULT ->
case
+# [ww_s39h w_s39g]
of
sat_s39k [Occ=Once1]
{
__DEFAULT ->
case +# [w_s39g 1#] of sat_s39j [Occ=Once1] {
__DEFAULT -> $wgo_s39f sat_s39j sat_s39k;
};
};
1# ->
case
+# [ww_s39h w_s39g]
of
sat_s39l [Occ=Once1]
{
__DEFAULT ->
case
GHC.Show.$wshowSignedInt
0# sat_s39l GHC.Types.[]
of
{
(#,#) ww5_s39n [Occ=Once1]
ww6_s39o [Occ=Once1] ->
: [ww5_s39n ww6_s39o];
};
};
};
end Rec }
} in $wgo_s39f 0# 0#;
1# ->
case GHC.Show.$wshowSignedInt 0# 0# GHC.Types.[] of {
(#,#) ww5_s39q [Occ=Once1] ww6_s39r [Occ=Once1] ->
: [ww5_s39q ww6_s39r];
};
};
};
};
: _ [Occ=Dead] _ [Occ=Dead] -> Main.main2;
};
};
};
} in
GHC.IO.Handle.Text.hPutStr'
GHC.IO.Handle.FD.stdout sat_s39v GHC.Types.True GHC.Prim.void#;
};
Which is quite close to the inner loop
(Go size i s) = if i == size then s else Go size (i+1) (s+i)
(Main n) =
let size = (* n 1000000)
(Go size 0 0)
Modulo a bit of unrolling that can't contribute more than a factor 2 of speedup in favor of GHC.
This kind of comparison is what I'm interested in. Now you might say "This is a toy benchmark!", but then so was the original one. Plus, I really don't want to give up on C level performance of tight loops (or at least a factor of 5 within) when I write such code.
Having numbers for this benchmark would say a lot to me. The benchmark from the main README, not so much.
from hvm.
Hello again! I missed this comment. To answer (if a little late):
This is actually the question that matters most to me: Is Beta optimality worth the losses? Will the losses get significantly smaller over time?
My argument is that there are no losses. There is nothing on HVM that makes it inherently slower than GHC. In some cases, today, it will be, because GHC has a constellation of micro-optimizations that HVM doesn't. Sometimes GHC will go as far as compiling a long loop to a constant, so obviously it will be faster. But that's all. One is a 3-decades-old compiler and runtime, improved by major unis and companies, the other is a new runtime written by a single dev. If you want a fair comparison, take a Haskell program, compile to GHC Core, and run it on both the HVM and on GHC's runtime. You'll notice they're very close together.
And it seems we just established that a simple counting loop will regress by a factor of ~50 when switching to a hypothetical HVM backend for GHC
No, that's not the case. Take GHC core and run it on HVM.
Besides, Haskell often has "fast non-allocating hot loops" such as repeated (Int)Map.lookup over the same map.
No, a suitable optimizer should replace that by a linear version, so there will be no dup at all. Kind should take care of that kind of situation.
I stand by my initial claim that the README suggests a false balance, namely that HVM
- is almost on par with GHC's (STG to native code) backend for sequential workloads
- faster when autoparallelisation is on
- and even faster when it manages to make an asymptotic difference due to beta optimality
I do not suggest that. My claim is that HVM is getting closer and closer (but not almost on par) with GHC in general, and will be significantly faster than it soon (in some cases, already) for real applications due to parallelism and GC-freedom. If you see any sentence that suggests otherwise, please let me know.
from hvm.
Related Issues (20)
- Useless `let` produced by rule sanitization HOT 2
- Error building with Nix HOT 4
- HVM doesn't use normal order reduction. HOT 3
- [Question] Kubernetes deployment? HOT 5
- Automatic deforestation. HOT 2
- Wrong behaviour of the `str_sugar` function
- Guide is outdated HOT 10
- On the performance of radix sort. HOT 6
- [Regression] Incorrect result on `nqueens` program HOT 5
- Questions about Lambda operator and parallel evaluation. HOT 1
- Performance regression on HVM v1.0 HOT 8
- Crash on thread spawning HOT 3
- Programs with a lot of dups get stuck HOT 5
- Multithreaded sequential programs spend most their active time waiting HOT 4
- Attempting to `cargo install --path .` for the Summation.hvm example from the HVM guide readme fails HOT 6
- Compiled HVM programs need a better CLI interface HOT 1
- Missing LICENSE file. HOT 5
- Make `Runtime` fields public? HOT 2
- Using HVM as transformation engine for Proc-Macros?
- List of builtin modules and functions? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hvm.