Giter VIP home page Giter VIP logo

Comments (17)

dnystrom1 avatar dnystrom1 commented on September 3, 2024

Yes, I think your are correct. Thanks for pointing that out.

Dave

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

Maybe you can test the whole program again? I ran into bus error under V8_AVX. In small-scale simulation I didn't get that error, but in large-scale, I got that. What's the reason? If it's because the misalignment of data, it should appear in small-scale.
Can you provide me some advice? That will help a lot. Thank you.

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

Thanks for the suggestion. We are preparing to make a release soon and will be doing more testing. I am not aware of there being a misalignment of data for either V4_AVX or V8_AVX. There is a misalignment in the field solver for V16_AVX512 and this will cause a run time error when using the GNU compiler. There is not a run time error when using recent versions of the Intel compiler, i.e. versions from the last year or two, because Intel silently replaces requests for aligned load and store intrinsics with the unaligned versions. If the data is properly aligned, you get the same performance as when using the aligned intrinsics. If the data is not properly aligned, the code runs with some performance degradation and does not cause a run time error. This is an issue I would like to fix soon. However, it is not clear to me that alignment is causing your bus error. Do you get the same error at scale when using V4_AVX?

Thanks,

Dave

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

I got this error when step is around 118000. Paddings of all data struct are the same as original code under V8. So I am doubting whether data misalignment caused the error. If not, what can be the reason? This must be tough to track the bug using gdb.
F.Y.I., I was just using the code whose advance_p_pipeline is v8 and other functions are just v4. Is using v8 func as well as v4 func at the same time a problem? Or something else?
Thanks for your suggestions. I will try it later.

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

Also, I want to ask a question. If I changed padding of data struct, expect the header files where they are declared need to be changed, what file which also relate to the padding I should also change? I am afraid that I only changed the header and some code should be changed but not.

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

There should not be a problem using v4 in some parts of the code and v8 in other parts. I'm not sure what is involved in terms of resource consumption for you to debug your problem at scale. Certainly, debugging at scale is much more difficult for many reasons including the resource consumption required. One strategy that I use if possible is to run the code in various configurations that should give the same physics result and see if the problem can be isolated to a subset of the code. Here are ideas.

  1. Run the code with and without sorting of the particles. Sorting the particles is just an optimization to give better cache utilization. Also, in the current production version of VPIC, there is the ability to use a legacy, thread serial sort or a newer thread parallel sort. So, you can see if the error depends on whether you are sorting and which sort you are using. Also, for the legacy sort, you can sort either in-place or out-of-place. Each of these options exercises different chunks of code.

  2. Run with only one thread per MPI rank. Does that make the problem go away? If so, maybe there is a problem with the code threading. Also, with threading, you can configure the code to either use Pthreads or OpenMP threads. Does the error isolate to one of the thread models?

  3. Turn off all diagnostics including the energy calculation. This should remove chunks of code. Does the error still persist with no diagnostics and no I/O?

  4. Run without the V4 or V8 vector intrinsics implementation i.e. using only the scalar version that is not vectorized. If the error exists with the scalar version of the code, that is a much smaller and easier subset of code to debug. If the error does not exist with the scalar version of the code, does it exist with both V4 and V8 versions or just one? If the error is with either or both of the V4 and V8 versions of the code, try running with the V4_PORTABLE or V8_PORTABLE versions of the code. If that makes the error go away, then you can make a custom version of the V4 or V8 intrinsics header where some of the functions are implemented with the portable versions and some with the intrinsics versions and use a bisection approach to try and identify which intrinsics function or functions is causing the error. There are two versions of the portable vector versions: one uses a for loop and the other uses the equivalent of the unrolled for loop. Additionally, it is easy to edit the code and run the scalar version for some functions and a vector version for other functions. This could be used to try isolating the problem to the vector version of a particular function.

  5. Turn off divergence cleaning and calculation of divergence errors.

  6. Try building and running with different compilers, different versions of compilers, less aggressive compiler optimization flags, etc. Does that make the error go away?

Depending on your problem scale and how much run time is required to reproduce the error, this could be a very resource intensive and time consuming process to try and isolate the problem. If that is the case, then you probably want to start running the simplest possible code configuration i.e. scalar version without particle sorting using a single pthread per MPI rank. If that makes the error go away, then you can add back in optimizations like particle sorting and vectorization until the problem returns.

At what scale are you running? Are you running with a single thread per MPI rank or multiple threads per rank? How much run time is required to get to your bus error? Have you tried to reproduce the error at smaller scales? What compiler and version are you using? Are you using any special code features like collision operators, dielectrics, particle emission or injection, etc?

Thanks,

Dave

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

Thanks for your responses. I am running the program using 2000 MPI rank with 2 cores and 2 threads per rank. It took like 8 hours to get bus error. I haven't tried to reproduce it at smaller scale and that's what I am gonna do later. I was using icc-14.0.2 and gcc-7.2 and didn't use any special code features.
Thanks a lot.

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

It just came up to me that maybe the reason is that my disk is full and no file can be added. When I checked the dump file after it was down, the number of dump file dir doesn't correspond to the step it ran. And someone in forum said that full disk can cause bus error.
I didn't write C++ a lot but I know if bus error is caused by misalignment of data, it should not occur after so many steps. Maybe this is the reason, I will turn off diagnostics and try again. And also I am looking forward to your opinion. Thank you.

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

The version of icc that you are using is quite old. icc-19 is in production and icc-20 is in beta. The comments I made about icc replacing aligned load requests with unaligned loads would not apply for icc-14. You should get a run time error right away if trying to do aligned loads and stores for data that is not properly aligned.

If you cannot reproduce at a smaller scale, another idea to consider is whether you could do a restart dump close to the time of failure and then determine if you can reproduce the error when you begin from that restart dump. If you can, then using the restart dump as a starting point could speed up the debugging process. You could not change the number of threads/rank after the restart dump but I think you could try all of the other configuration change suggestions.

You are correct that you will get the run time error right away if you are trying to use instructions that require aligned data on data that is not properly aligned.

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

Thanks for that.
I remove the diagnostics func and everything works fine when step goes to 118000.
And now I am trying to optimize the boundary_p part which becomes the most time-expensive part when scale gets larger. And I wonder why you don't need to compare the particle's position in voxel and its momentum to boundary condition like

if( (dx == -1) && (ux < 0) )
 { 
      load particle into send buffer 
} 

I am pretty confused about it because my original code does this and it makes sense. But I don't see the reason why you can do it without comparing.

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

When you refer to your original code, which code do you mean? An earlier version of VPIC? Or some other code?

In the production version of VPIC, particles which move to another cell are first processed by calls to move_p in the appropriate advance_p_pipeline_xxx function. Within move_p, it is determined whether a particle hits a processor boundary and needs further processing by boundary_p. So, it seems to me that particles that are still on the list for processing by boundary_p will need to be moved to another processor domain.

Does the problem you are trying to model have a streaming velocity? If so, it may be useful to think about whether you can do a processor domain decomposition that minimizes crossing into other processor domains i.e. by having pencil shaped domains where the long dimension is in the streaming direction. If boundary_p is dominating your run time for larger scale, it sounds like you must have a large fraction of your particles changing processor domains each time step. Is that the scenario you are dealing with?

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

I am having an early version of VPIC called vpic-407 which you mentioned before is an out-of-date version code.
You are right, condition judgement is unnecessary here. Thanks for your explanation.
The decomposition way sounds well, but when the situation is that we have to simulate a scenario with very large scale, only adjusting the topology_x/y/z to the nx/y/z( I think this is what you mean) to make the long dimension in the streaming direction is not good enough.
As the scale gets larger, g_time counting the time of boundary_p as well as synchronize_jf gets larger. Maybe this is what should be optimized in the next release version. I am also working on this too and looking forward to your opinion.
Thank you so much.

from vpic.

MoZeWei avatar MoZeWei commented on September 3, 2024

Also, I found that as step gets larger in one simulation, the time spent on communication like boundary_p( ) every 1000 steps increases massively. What's the reason? Or this is related to the model I am simulating ?

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

Sorry to take so long to respond. I have been in training sessions most of the last two days.

Are you able to send me a copy of the input deck which you are having problems with? I don't think I can comment intelligently without knowing more details about your problem.

from vpic.

rfbird avatar rfbird commented on September 3, 2024

Closing due to being stale

from vpic.

dnystrom1 avatar dnystrom1 commented on September 3, 2024

FWIW, I am pretty sure I fixed that bug.

from vpic.

rfbird avatar rfbird commented on September 3, 2024

Yep, I checked before closing. You guys can feel free to reopen if your conversation brings up other changes

from vpic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.