Giter VIP home page Giter VIP logo

Comments (24)

user18081972 avatar user18081972 commented on June 12, 2024 1
AvalonCode.mp4

heres a video example. hopefully it helps

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

@nadiaholmquist I've tested with a Vega 8 (3500U) on Win10 with an old driver from 2023 and it works fine even at 16x, although attempting to switch resolutions while the game was running completely froze the renderer. I've also tested with LLVMpipe in a VM and it also works fine even at 16x. It looks like it's a bug specific to Nvidia. I'll test Nvidia on Linux with the proprietary driver, I expect that it will behave just like it does in Win10.

Edit: Nvidia on Linux with the proprietary driver gives the exact same result as on Win10.

from melonds.

RSDuck avatar RSDuck commented on June 12, 2024 1

oh no, there still might be undefined behaviour somewhere in the shaders. I'm going to run the shaders with mesa and a debug context eventually to see whether it warns about something.

from melonds.

user18081972 avatar user18081972 commented on June 12, 2024 1

I decided to downgrade my GPU driver (Nvidia RTX 3080) to the oldest supported one (version 456.71) to see if it might be a 'regression'.
The same issue occurs on this driver version from October 7, 2020.

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

glDispatchComputeIndirect will not log OpenGL errors even if the compute work group counts are too high, and checking these would be annoying since I'd have to read the buffer back from VRAM. However, RenderDoc did capture the values used for these. At 6x none of the calls ever go above the minimum of 65535. At 7x there is one call that does exceed that minimum (glDispatchComputeIndirect(<1, 1, 79414>)). This exceeds the maximum supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but it did not fix it. It's likely this is the problem on Nvidia. The local sizes all seem fine.

Edit: Here's a gist on all the testing I've been doing https://gist.github.com/FireNX70/10c72169ea0105f998bab6f51443d42e

from melonds.

RSDuck avatar RSDuck commented on June 12, 2024 1

thank you for debugging that.

Regarding the barriers that is obviously a mistake. Though GL_SHADER_STORAGE_BUFFER is defined as 0x90D2 (which should result in a GL error), though if the driver accepts it and only uses the valid bits, most GPUs don't have that fine grained barriers anyway. So there's probably atleast one value set which will insert a correct barrier.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

thank you for debugging that.

No problem, once this is fixed I can fully switch to MelonDS.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

There's a minimum required limit of 65535. Implementations are free to establish higher limits. I haven't checked what limits Mesa and Radeon report, Nvidia has a higher limit on the X dimension but sticks to 65535 for the Y and Z dims.

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

I'm not well versed in graphics nor OpenGL, certainly not enough to understand what the compute renderer's doing. Having said that, it looks to me as though the right thing to do would be to cap VariantWorkCount's Z component to the respective GL_MAX_COMPUTE_WORK_GROUP_COUNT and do multiple passes per variant if needed. Easy to say, probably a pain in the ass to implement. I don't think larger tiles would guarantee the work group count wouldn't go above the limit at higher internal resolutions.

from melonds.

RSDuck avatar RSDuck commented on June 12, 2024 1

The problem is that the amount of work done per variant is not known to the cpu at the time it issues the indirect compute calls (that's the reason they are indirect in the first place). And letting the GPU do the binning, then waiting on that and then dispatching as much compute jobs as necessary would work, but would just waste CPU time on waiting.

After sleeping over this, there is technically another solution. The other two axis as well and then put all the axis together into one index in the shader.

I quite like the idea of increasing the tile size at very high resolutions though. The increased tile size results in a lower amount of tiles still filling the same area. As long as the total amount of tiles is less than 2^16 no single variant can result in a larger compute job than that.

What of course remains a potential issue is that the tile buffer can overflow, as it has a fixed size. It would be possible to conservely estimate the size necessary each frame by using the bounding boxes of each polygon, but with some polygon shapes that might result in excessive allocations.

Both solutions don't exclude each other, I guess I can also just implement them both.

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

Scaling the tile size with the internal resolution sounds good. Do keep in mind there's a limit on local size too, although the minimum of 1024 should be fine (1024x1024 for a single tile would be huge). As for the tile buffer; IMO it's better safe than sorry, especially if it doesn't overestimate by a lot.

Edit: 11x (almost 4K on a single screen) uses about 2650 MB of VRAM right now. It's not nothing, but there's some margin for the tile buffer to grow.

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

I've been messing around with scaled tiles. 16x16 tiles work fine, 32x32 tiles (the max size, since the tile size is used for the X and Y local sizes in certain shaders and I'm sticking to the default GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS of 1024) presents a couple of problems.

glDispatchCompute(TilesPerLine*TileLines/32, 1, 1), used for ShaderClearCoarseBinMask, is a problem because the total tile count (TilesPerLine*TileLines) with size 32 tiles will not divide by 32 nicely. I think we divide TilesPerLine * TileLines by 32 because of the shader's local size. That division works fine with both 16 and 48. It won't be optimal for Nvidia (warp size of 32) nor Radeon (RDNA wave size is 32, GCN wave size is 64). I'm not sure if the divider for CoarseBinStride has to match the local size too. If so, this only works if it's 16. It's as far from optimal as 48 for Nvidia and RDNA but probably worse for GCN.

glDispatchCompute(((gpu.GPU3D.RenderNumPolygons + 31) / 32), ScreenWidth/CoarseTileW, ScreenHeight/CoarseTileH), used for ShaderBinCombined, is also a problem. With 32x32 tiles, CoarseTileH would end up being 128. At lower tile sizes it's 32 or 64, which 192 is a multiple of, so it was guaranteed the vertical resolution would be a multiple of CoarseTileH. 192 is not a multiple of 128, so a CoarseTileH of 128 becomes a problem at certain resolutions (specifically, odd multiples of the base resolution). This (I think) is what would cause a blank strip at the bottom of the screen with odd scales. Making CoarseTileCountY variable should fix it, but it introduced more bands (the number of bands matches the scale). With 6-tile tall coarse tiles (when using 32x32 tiles) the height of a coarse tile is 192 (that way we make sure internal resolution is a multiple of CoarseTileH), so the number of coarse tile lines matches the scale. What I think is really happening here is there's actually one two-tile-thick band per coarse tile because there's a hardcoded 4 somewhere so there's 2 rows of tiles that are ignored per coarse tile.

Just in case here's what I've done: https://github.com/FireNX70/melonDS/tree/try-fixing-the-compute-renderer

from melonds.

user18081972 avatar user18081972 commented on June 12, 2024 1

I build your branch, and it fixes it until 10x IR, then static blue lines appear in ChibiRobo
image

And in Avalon Code static stripes of missing pixels
image

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024 1

Yeah, those are the stripes I was talking about. Starting at 10x it uses 32x32 tiles.

from melonds.

RSDuck avatar RSDuck commented on June 12, 2024

Sorry I have Intel iGPU, I can't debug these things.

They probably just need even more work tiles or some other index starts overflowing. I need to come up with a smarter way of allocating the work tile amount or atleast give an option to override it.

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024

I tried doubling MaxWorkTiles and it didn't fix it. Increasing the size of FinalTileMemory didn't do anything either.

from melonds.

takeshineale128 avatar takeshineale128 commented on June 12, 2024

no issues on my gtx 1660 gpu at 16x ir on win 11 with driver version v552.44

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024

I've tested both updating the driver and using a T1000. Neither fixed this. So it's not the driver version or the architecture. I really don't think the OS would make a difference, especially if it's Win10 vs Win11. Were you really testing with the compute renderer?

from melonds.

takeshineale128 avatar takeshineale128 commented on June 12, 2024

opengl

from melonds.

FireNX70 avatar FireNX70 commented on June 12, 2024

There's two OpenGL renderers. Are you testing with a dev version or are you using 0.9.5?

from melonds.

takeshineale128 avatar takeshineale128 commented on June 12, 2024

0.9.5

from melonds.

RSDuck avatar RSDuck commented on June 12, 2024

Increasing the size of FinalTileMemory didn't do anything either.

FinalTileMemory isn't dynamically allocated, it cannot overflow.

Not really sure what could be the issue here. It seems to be definitely tile related and the index should not overflow (yet).

from melonds.

nadiaholmquist avatar nadiaholmquist commented on June 12, 2024

Can't seem to reproduce this on an AMD GPU (Steam Deck). Will try on other stuff later.

from melonds.

user18081972 avatar user18081972 commented on June 12, 2024

opengl

@takeshineale128
This is the OpenGL renderer we are talking about, so the compute shader one, not the classic one
image

from melonds.

takeshineale128 avatar takeshineale128 commented on June 12, 2024

not useing dev builds, my 0.9.5 build has no opengl option for classic or compute shader. its prob classic
Screenshot 2024-05-18 051021

from melonds.

KostaSaizo7 avatar KostaSaizo7 commented on June 12, 2024

Blue Dragon - Awakened Shadow
Windows 10
Nvidia GPU
Compute shader x8 (x7 and below is fine)
https://www.youtube.com/shorts/TX5Nq3RdX5Q

from melonds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.