Hello, I am looking into this project to learn what you have done in this work. I

OpenCL failed to compile/vectorvisor stall about vectorvisor HOT 4 CLOSED

samginzburg commented on May 27, 2024 1

OpenCL failed to compile/vectorvisor stall

from vectorvisor.

Comments (4)

sjlcwn commented on May 27, 2024

When I run the following command to start vectorvisor, with same testing command.

cargo run --release -- --ip=0.0.0.0 --heap=3145728 --stack=262144 --hcallsize=131072 --partition=false --serverless=true --vmcount=1 --vmgroups=1 --interleave=1 --pinput=true --fastreply=true --lgroup=1 --nvidia=true --input=benchmarks/scrypt/target/wasm32-wasi/release/scrypt.wasm
go run run_scrypt.go 127.0.0.1 8000 1 1 300 256

Vectovisor would complain about CL_NV_INVALID_MEM_ACCESS. The full log is in the attachment.
invalid_memory.txt

Do you have some idea about why this happen, or some advice on debugging with the generated kernel?

from vectorvisor.

SamGinzburg commented on May 27, 2024

Hi,

Thanks for taking the time to check out VectorVisor! Those are all great questions.

First, I should probably explain the "partitions" concept as it wasn't included in the final paper (and it was also excluded from the evaluation). Early on (and without the partitioner in general) I ran into problems with register spilling and was trying to figure out ways to reduce the overhead from this for the comparatively large programs I was trying to run. The idea is that the overhead from calling into other kernels via the CPS transform was cheaper in some cases than the extra register spilling, and we did have positive results.

The solution I came up with was to partition the resulting openCL kernels by function size/register usage based on the control flow graph (which functions call which other functions + some other heuristics). The partitioner does actually work despite not being in the paper, but when the --partition=false flag is present, there also needs to be a --maxdup=0 flag paired with it to ensure that functions aren't duplicated (as they would be if it were enabled). Alternatively you can enable it and play around with the maxdup value. The default value for this is "1" (1 extra include per function), which is why this error is popping up.

I think you brought up a good point though, which is that this mismatch with false/maxdup could be checked in src/main and an error value returned then (I was the only user for quite some time so never ran into this haha).

For the invalid memory access I suspect it is a result of the lack of compiler optimizations applied to the WASM binary which causes extra memory usage in the final program. It's also possible that commenting those lines out caused weird behavior/bugs elsewhere. The set entry point debug line only prints the expected value when you are running 1 function per partition (an artifact from a debug configuration I used often). You can run with the debugcallprint flag enabled and that will correctly log the entry point in any configuration, although it will dump a volume of data to stdout (all function calls, etc...).

If you are testing locally, you can make use of the run_cached_bin.sh script (poorly named I admit) in the benchmarks dir, which will run all of the compiler optimizations we run in the final paper + invoke VectorVisor with the correct CLI arguments.

To run the scrypt benchmark locally you would add the line to the script:

# format is:
# command, benchmark, heap, stack, hypercall buffer, vmcount, ignore last arg
comp "scrypt" "3145728" "131072" "131072" "VMCOUNT" "5120"

and replace VMCOUNT with a vm count value that fits your local GPU (ignore the last argument, as the script was copy pasted from the script I used in the final evaluation).

After the benchmark is ran at least once you can replace the comp command with runbin and VV should load much faster.

I just reran the benchmark on an RTX 2080 Ti + 2048 VMs and it works for me. Let me know if it still doesn't work after.

from vectorvisor.

sjlcwn commented on May 27, 2024

With the help of run_cached_bin.sh, I could run the scrypt benchmark and get some result.

#added to run_cached_bin.sh
comp "scrypt" "3145728" "131072" "131072" "64" "5120"

go run run_scrypt.go 127.0.0.1 8000 64 1 300

server is active... starting benchmark
Benchmark complete: 232249 requests completed
duration: 300.000000
Total RPS: 774.163333
On device execution time: 39445040.440148
Average request latency: 82649467.696640
queue submit time: 9007.574870
submit count: 1.000000
unique fns: 1.000000
Request Queue Time: 3900.823155
Device Time: 82170664.866686
overhead: 488422.273129
compile time: 326291776113.782898

And I realized that some paramters of the testing script are related to the VV setting.

Thanks for the explanation and suggestions.

from vectorvisor.

SamGinzburg commented on May 27, 2024

No problem. I'll close this issue for now as it seems to be resolved.

from vectorvisor.

OpenCL failed to compile/vectorvisor stall about vectorvisor HOT 4 CLOSED

Comments (4)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent