I was working on a large structural piece (600,000 atoms, 50 nm x 120 nm), and found a

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

Can you give a fuller deion of what you're doing? What source code do you recom

Large time taken to initialize `OpenMM::Context` about openmm HOT 20 OPEN

philipturner commented on June 2, 2024

Large time taken to initialize `OpenMM::Context`

from openmm.

Comments (20)

peastman commented on June 2, 2024 1

About 1/3 of the time is spent creating the integrators. The good news about that is that it shouldn't depend much on system size. I think that's mostly building and compiling the kernels. But you also could probably make it faster. You have four integrators, but the last two are nearly identical to the first two. The only difference is that a couple of constants are divided by two. If you create a global parameter that you set to either 0.5 or 1.0, you should be able to eliminate two of the integrators.

Most of the rest of the time is spent initializing the two CustomNonbondedForces. Would it be possible to merge them into a single force? That would likely speed it up at least a bit. For example, the single most expensive part is validating the exclusions in these lines:

openmm/openmmapi/src/CustomNonbondedForceImpl.cpp

Lines 84 to 109 in 4e74252

 for (int i = 0; i < owner.getNumExclusions(); i++) { 

 int particle1, particle2; 

 owner.getExclusionParticles(i, particle1, particle2); 

 if (particle1 < 0 || particle1 >= owner.getNumParticles()) { 

 stringstream msg; 

 msg << "CustomNonbondedForce: Illegal particle index for an exclusion: "; 

 msg << particle1; 

 throw OpenMMException(msg.str()); 

 } 

 if (particle2 < 0 || particle2 >= owner.getNumParticles()) { 

 stringstream msg; 

 msg << "CustomNonbondedForce: Illegal particle index for an exclusion: "; 

 msg << particle2; 

 throw OpenMMException(msg.str()); 

 } 

 if (exclusions[particle1].count(particle2) > 0 || exclusions[particle2].count(particle1) > 0) { 

 stringstream msg; 

 msg << "CustomNonbondedForce: Multiple exclusions are specified for particles "; 

 msg << particle1; 

 msg << " and "; 

 msg << particle2; 

 throw OpenMMException(msg.str()); 

 } 

 exclusions[particle1].insert(particle2); 

 exclusions[particle2].insert(particle1); 

 }

With 4.8 million exclusions, that takes a few seconds. I don't see any practical way to parallelize it, but if you had only a single CustomNonbondedForce, it would only need to do it once. I also could probably speed it up a little by only recording each exclusion in one set. Let me try that.

from openmm.

peastman commented on June 2, 2024 1

Which is a good illustration of how restructuring algorithms to be parallelizable is hard! :)

from openmm.

peastman commented on June 2, 2024

Can you profile it to see what is taking up that time? There's a lot that happens while constructing a Context. Which part is causing the slowdown?

from openmm.

philipturner commented on June 2, 2024

from openmm.

peastman commented on June 2, 2024

The numbers don't add up, probably because it was compiled with optimizations? That shifts the boundaries between functions and makes it hard to interpret. Can you send me a serialized System, Integrator, and State? Then I can reproduce it and figure out what's going on.

I see you're using the Metal platform. Does Context creation take a similar amount of time with OpenCL?

from openmm.

philipturner commented on June 2, 2024

The system was too large to attach in a comment (1 GB uncompressed), so I made a Google Drive link instead.

https://drive.google.com/drive/folders/171okSdyQUJcMimwTTItqgTVnOZOdsjmf?usp=sharing

When you read the system file, the atomic masses may look strange. I'm using SI units and they're in yoctograms. HMR = 2.0.

I see you're using the Metal platform. Does Context creation take a similar amount of time with OpenCL?

For some reason, when the Metal plugin isn't loaded, the only platform that appears is Reference. I don't know why it appears that way in my workflow. Probably some detail of the build environment.

I am using code that's out of sync with the main branch by several months. Perhaps that's a reason it's slower than expected.

from openmm.

peastman commented on June 2, 2024

Your system has over 700,000 particles. 8 seconds to create a context for such a large system is pretty fast. Can you describe your workflow, and why the context creation time is a problem?

from openmm.

philipturner commented on June 2, 2024

I need instant feedback loops while engineering something. For example, I regularity recompile a nano-part when a single atom is wrong. That means the entire source code file gets recompiled, then re-executed. The latency to execute the code is typically ~100 ms for systems of this size.

When that jumps to 10,000 ms, you break out of “the flow”. A GPU cluster will not beat Amdahl’s law and fix that 10 seconds. The purpose of using that cluster is to get several-picosecond simulations down from minutes to seconds per simulation, with low-latency data transfer to my renderer. I thought overcoming a network latency issue was the biggest problem, and I had solved it.

A few weeks ago, I made a 20-million atom system that compiled in 1.1 seconds. There was no simulation involved though. You can easily see how 20 million atoms with the current OpenMM::Context becomes unworkable, taking 4 minutes just to get the most basic feedback.

from openmm.

philipturner commented on June 2, 2024

I estimate this project is going to be ~5 million atoms for the housing and ALU combined. 64 seconds of latency is barely workable.

from openmm.

peastman commented on June 2, 2024

Can you give a fuller description of what you're doing? What source code do you recompile and what does it do? What is the role of OpenMM in the process? There's no way you'll get context creation down to anything like 100 ms for a system of that size, so you need to find a design that doesn't require creating a new context every time you change one atom.

from openmm.

philipturner commented on June 2, 2024

Can you give a fuller description of what you're doing?

Nanomechanical ALU with ~720 logic switches (mechanical transistors), each ~4000 atoms. Driven by a flywheel contained in an isolated system, clock speed 600 MHz.

What source code do you recompile and what does it do?

https://www.youtube.com/watch?v=AC34BQt2ODM

What is the role of OpenMM in the process?

https://www.youtube.com/watch?v=7ugZSuBasDg

There's no way you'll get context creation down to anything like 100 ms

No, but 10 seconds might be reasonable. And the compute cluster has an AMD Threadripper with ~50 cores. Theoretically, 1.2 seconds would be possible for compute jobs there. But 10 seconds would be good enough for 5M atoms.

from openmm.

philipturner commented on June 2, 2024

I think that's mostly building and compiling the kernels.

The kernels by themselves take ~100 ms to compile with OpenCL, provided it hits the Metal shader cache. I think there's a hidden $O(n)$ bottleneck in the integrator initialization, independent of OpenCL compilation.

About 1/3 of the time is spent creating the integrators. The good news about that is that it shouldn't depend much on system size. I think that's mostly building and compiling the kernels. But you also could probably make it faster. You have four integrators, but the last two are nearly identical to the first two. The only difference is that a couple of constants are divided by two. If you create a global parameter that you set to either 0.5 or 1.0, you should be able to eliminate two of the integrators.

I provide two options in my framework. The first is OpenMM's built-in velocity Verlet with 2.5 fs timestep. The second is multiple timestep with 2.15 / 4.3 fs. The first one has a higher $O(n)$ prefactor, while the second has higher $O(1)$ latency. The setup time is the same with either integrator. Here is with .verlet:

MM4System.init
0 to 1: 7.500173524022102e-06
1 to 2: 0.01960912486538291
2 to 3: 0.8759976667352021 <-- 0.88 seconds
3 to 4: 2.874992787837982e-06

MM4ForceField.init
0 to 1: 0.895652791718021
1 to 2: 8.065219749929383 <-- 8 seconds
2 to 3: 4.998873919248581e-07

  // MM4Context, L19-39
  switch descriptor.integrator {
    case .multipleTimeStep:
      self.compoundIntegrator = OpenMM_CompoundIntegrator()
      self.integrator = compoundIntegrator!
      
      for start in [false, true] {
        for end in [false, true] {
          var descriptor = MM4CustomIntegratorDescriptor()
          descriptor.start = start
          descriptor.end = end
          
          let integrator = MM4CustomIntegrator(descriptor: descriptor)
          integrator.integrator.transfer()
          let index = compoundIntegrator!.addIntegrator(integrator.integrator)
          customIntegrators[descriptor] = index
        }
      }
    case .verlet:
      self.verletIntegrator = OpenMM_VerletIntegrator(stepSize: 0)
      self.integrator = verletIntegrator!
    }

Most of the rest of the time is spent initializing the two CustomNonbondedForces. Would it be possible to merge them into a single force?

That would definitely be possible. I split them into separate forces because of the switching function. I want to apply it to vdW forces, but leave it out of electrostatic forces. I could probably write an energy expression that explicitly does the switching function, allowing them to fuse into one force. Previously I didn't see a need for investing this effort.

Source Code

    // MM4NonbondedForce, L61-89
    let force = OpenMM_CustomNonbondedForce(energy: """
      epsilon * (
        -2.25 * (min(2, radius / safe_r))^6 +
        1.84e5 * exp(-12.00 * (safe_r / radius))
      );
      safe_r = max(0.001, r);
      epsilon = select(isHydrogenBond, heteroatomEpsilon, hydrogenEpsilon);
      radius = select(isHydrogenBond, heteroatomRadius, hydrogenRadius);
      
      isHydrogenBond = step(hydrogenEpsilon1 * hydrogenEpsilon2);
      heteroatomEpsilon = sqrt(epsilon1 * epsilon2);
      hydrogenEpsilon = max(hydrogenEpsilon1, hydrogenEpsilon2);
      heteroatomRadius = radius1 + radius2;
      hydrogenRadius = max(hydrogenRadius1, hydrogenRadius2);
      """)
    force.addPerParticleParameter(name: "epsilon")
    force.addPerParticleParameter(name: "hydrogenEpsilon")
    force.addPerParticleParameter(name: "radius")
    force.addPerParticleParameter(name: "hydrogenRadius")
    
    if let cutoffDistance = descriptor.cutoffDistance {
      force.nonbondedMethod = .cutoffNonPeriodic
      force.cutoffDistance = Double(cutoffDistance)
      
      force.useSwitchingFunction = true
      force.switchingDistance = Double(cutoffDistance * pow(1.0 / 3, 1.0 / 6))
    } else {
      force.nonbondedMethod = .noCutoff
    }

    // MM4ElectrostaticForce, L87-115
    let prefactor = MM4ElectrostaticForce.prefactor
    var K, C: Float
    if let cutoffDistance = descriptor.cutoffDistance {
      (K, C) = MM4ElectrostaticForce.reactionFieldConstants(
        cutoffDistance: cutoffDistance,
        dielectricConstant: descriptor.dielectricConstant)
    } else {
      (K, C) = (.zero, .zero)
    }
    
    let force = OpenMM_CustomNonbondedForce(energy: """
      \(prefactor) * charge1 * charge2 * (
        1 / r + \(K) * r^2 - \(C)
      );
      """)
    force.addPerParticleParameter(name: "charge")
    
    if let cutoffDistance = descriptor.cutoffDistance {
      // Force is discontinuous at the cutoff
      // (https://www.desmos.com/calculator/a9my66qauu). However, adding a
      // switching function like with the vdW force makes it worse. It
      // introduces a jump in force as the energy drops faster than before.
      // There's not much wrong with discontinuous forces, but there's
      // definitely something wrong with discontinuous energies.
      force.nonbondedMethod = .cutoffNonPeriodic
      force.cutoffDistance = Double(cutoffDistance)
    } else {
      force.nonbondedMethod = .noCutoff
    }

I don't see any practical way to parallelize it

I would revisit that conclusion. There's always a way to make an algorithm more parallel. I've simplified and parallelized vastly more complex things than this. Perhaps I could reproduce the C++ code in Swift and try modifying it.

from openmm.

philipturner commented on June 2, 2024

I could try a “nonbonded counter-force”. Include the 1,3 and maybe 1,2 pairs in the nonbonded force. Then add another bonded force between each pair with the inverse effect. It could be an option to enable only when setup time vastly outweighs the increase in $O(n)$ cost.

from openmm.

philipturner commented on June 2, 2024

Is it possible to skip the detection of molecule groups entirely? I don't need this functionality. Atoms are sorted in Morton order beforehand. There is no solvent or identical water molecules; this is ultra-high vacuum at 0.4 Kelvin.

from openmm.

peastman commented on June 2, 2024

Reordering isn't just for water. It applies to any set of identical molecules. In your video, it looks like there are a lot of sets like that. You can presort the atoms, but can you guarantee they won't move significantly during the simulation?

Trying to disable molecule detection sounds dangerous to me. The potential benefit is small. The potential risk of messing something up is large.

from openmm.

bdenhollander commented on June 2, 2024

Are parallel std::for_each loops from C++17 an option here?

from openmm.

philipturner commented on June 2, 2024

Trying to disable molecule detection sounds dangerous to me. The potential benefit is small. The potential risk of messing something up is large.

These are rigid bodies. They deform slightly, but not enough to significantly alter which atoms are close to which. The vast majority of atoms are on the interior of an object. They're also positionally constrained, so they never move much relative to other molecules. For example, the logic switches in an ALU can't swap places without leaving machine phase.

For this ALU, I think 90% of the molecules will be unique and non-identical. For example, each logic rod has a knob placed at a slightly different location. That could change it from a regular gate input to a NOT inverted input. Omitting a knob means the rod doesn't interact with the other rod at that junction. The logic structure is highly asymmetric and inhomogeneous.

from openmm.

philipturner commented on June 2, 2024

Are parallel std::for_each loops from C++17 an option here?

Is that like DispatchQueue.concurrentPerform? That's a convenient Swift function that performs a load-balanced execution of several function calls across all the CPU cores. Provided you remove data dependencies and divide into homogeneous tasks of ~20 microseconds work, this fully utilizes the CPU for almost any time-consuming operation. Don't know whether multithreading is that easy in C++.

from openmm.

peastman commented on June 2, 2024

Multithreading is easy. Restructuring the algorithm to be parallelizable is hard! For example, the code shown above compares every exclusion to every other to make sure they all involve distinct particles. You could structure that as an O(n^2) loop over all pairs of exclusions. That would be easy to parallelize, but also really inefficient. Instead it's structured as a single pass over exclusions, storing the particle indices into sets that can be checked easily. That's much more efficient, but also requires the exclusions to be processed strictly in order.

I'm making good progress just with micro-optimizations. In the case of comparing exclusions, I was able to cut the number of operations on sets in half with just a minor change to the code. For findMoleculeGroups(), it turned out a lot of the time was spent constructoring vectors that were used once and then discarded. Moving the declaration out of the loop eliminated most of that time.

from openmm.

philipturner commented on June 2, 2024

I do find, even in some compute-intensive code, parts that fail to parallelize. For example, a reduction operation across an array was always slower on multicore. Always make the code as fast as possible on single-core. Only once there's zero room for improvement, reach for multithreading.

You could structure that as an O(n^2) loop over all pairs of exclusions. That would be easy to parallelize, but also really inefficient.

You could pre-allocate memory regions for any set that will be accessed pseudo-sequentially. Then parallelize the writing of data into different places. Atomics have quite low overhead for such purposes. I can estimate no more than ~10-20 exclusions will ever exist per atom for my use case. Allocating even conservative pre-allocations is not a bottleneck: 20 x 4 bytes x 1 million / 100 GB/s = 0.8 milliseconds.

As an example, this code multithreaded a bottleneck in parameter assignment. The subsequent pass collected up the results of the MT part on single-core. I didn't parallelize the latter b/c it wasn't the bottleneck.

from openmm.

Large time taken to initialize `OpenMM::Context` about openmm HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	for (int i = 0; i < owner.getNumExclusions(); i++) {
	int particle1, particle2;
	owner.getExclusionParticles(i, particle1, particle2);
	if (particle1 < 0 \|\| particle1 >= owner.getNumParticles()) {
	stringstream msg;
	msg << "CustomNonbondedForce: Illegal particle index for an exclusion: ";
	msg << particle1;
	throw OpenMMException(msg.str());
	}
	if (particle2 < 0 \|\| particle2 >= owner.getNumParticles()) {
	stringstream msg;
	msg << "CustomNonbondedForce: Illegal particle index for an exclusion: ";
	msg << particle2;
	throw OpenMMException(msg.str());
	}
	if (exclusions[particle1].count(particle2) > 0 \|\| exclusions[particle2].count(particle1) > 0) {
	stringstream msg;
	msg << "CustomNonbondedForce: Multiple exclusions are specified for particles ";
	msg << particle1;
	msg << " and ";
	msg << particle2;
	throw OpenMMException(msg.str());
	}
	exclusions[particle1].insert(particle2);
	exclusions[particle2].insert(particle1);
	}