Comments (29)
Related document: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html. The key is to use Eigen::initParallel();
before creating threads. I am investing on this.
from open3d.
Another trick you may want to consider:
For each thread, it has an independent AtA and Atb.
Then at the end we sum them together.
In this case there is no #pragma omp critical
inside the loop.
from open3d.
I tried
// Option #5
// Directly Building JtJ and Jtr, trick for parallelization
Eigen::Matrix6d ATA;
Eigen::Vector6d ATb;
tstart = time(0);
std::vector<Eigen::Matrix6d> ATA_array;
std::vector<Eigen::Vector6d> ATb_array;
ATA_array.resize(NCORR);
ATb_array.resize(NCORR);
#ifdef _OPENMP
#pragma omp parallel for schedule(static)
#endif
for (int i = 0; i < NCORR; i++) {
const Eigen::Vector3d &vs = data[i];
const Eigen::Vector3d &vt = data[i];
const Eigen::Vector3d &nt = data[i];
double r = (vs - vt).dot(nt);
Eigen::Vector6d A_r;
A_r.block<3, 1>(0, 0) = vs.cross(nt);
A_r.block<3, 1>(3, 0) = nt;
ATA_array[i] = A_r * A_r.transpose();
ATb_array[i] = A_r * r;
}
ATA.setZero();
ATb.setZero();
for (int i = 0; i < NCORR; i++) {
ATA += ATA_array[i];
ATb += ATb_array[i];
}
tend = time(0);
std::cout << "Option5 took " << difftime(tend, tstart) << " second(s)." << std::endl;
and it gets
Option5 took 20 second(s).
In addition, Eigen::initParallel();
does not change execution time. It is not need to indicate this function if the compiler provide native support of C++11.
from open3d.
No, I was suggesting two options:
- Use
#pragma reduction
#pragma omp parallel for schedule(static) reduction(+:ATA, ATr) ...
- Create exact
omp_get_max_threads()
copies ofATA_array
andATb_array
.
#ifdef _OPENMP
int max_thread_num = omp_get_max_threads();
std::vector<Eigen::Matrix6d> ATA_array(max_threads_num);
std::vector<Eigen::Vector6d> ATb_array(max_threads_num);
#elseif
std::vector<Eigen::Matrix6d> ATA_array(1);
std::vector<Eigen::Vector6d> ATb_array(1);
#endif
int thread_num = 1;
#pragma omp parallel for schedule(static)
for (int i = 0; i < NCORR; i++) {
#ifdef _OPENMP
if (i == 0) {
thread_num = omp_get_num_thread();
}
int thread_id = omp_get_thread_num();
#elseif
int thread_id = 0;
#endif
const Eigen::Vector3d &vs = data[i];
const Eigen::Vector3d &vt = data[i];
const Eigen::Vector3d &nt = data[i];
double r = (vs - vt).dot(nt);
Eigen::Vector6d A_r;
A_r.block<3, 1>(0, 0) = vs.cross(nt);
A_r.block<3, 1>(3, 0) = nt;
ATA_array[thread_id] = A_r * A_r.transpose();
ATb_array[thread_id] = A_r * r;
}
ATA.setZero();
ATb.setZero();
for (int i = 0; i < thread_num; i++) {
ATA += ATA_array[i];
ATb += ATb_array[i];
}
tend = time(0);
std::cout << "Option5 took " << difftime(tend, tstart) << " second(s)." << std::endl;
from open3d.
Another comment:
You can use three::ScopeTimer
for benchmark. It is very handy, 👍
from open3d.
Option 1:
Severity Code Description Project File Line Suppression State
Error C3031 'ATb': variable in 'reduction' clause must have scalar arithmetic type OdometryRGBD C:\git\Open3D\src\Experimental\OdometryRGBD\OdometryRGBD.cpp 273
I guess Eigen::MatrixXd
cannot be used with reduction?
Option 2:
took 1245.81 ms.
Awesome! but I would also like to see Option 1 is working as it is much more compact.
from open3d.
See this answer:
https://stackoverflow.com/questions/30496365/parallelize-the-addition-of-a-vector-of-matrices-in-openmp
You can also beautify the code of option 2 a bit. Say, put them in a bit omp parallel, and get rid of omp_get_max_threads(). See this for example:
https://stackoverflow.com/questions/30943452/openmp-c-parallel-for-loop-with-reduction-afterwards-best-practice
from open3d.
Yeah. The reference helps a lot!
// Option #7
Eigen::Matrix6d ATA;
Eigen::Vector6d ATb;
ATA.setZero();
ATb.setZero();
three::ScopeTimer timer("Option 7");
{
#ifdef _OPENMP
#pragma omp parallel
{
#endif
Eigen::Matrix6d ATA_private;
Eigen::Vector6d ATb_private;
ATA_private.setZero();
ATb_private.setZero();
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (int i = 0; i < NCORR; i++) {
const Eigen::Vector3d &vs = data[i];
const Eigen::Vector3d &vt = data[i];
const Eigen::Vector3d &nt = data[i];
double r = (vs - vt).dot(nt);
Eigen::Vector6d A_r;
A_r.block<3, 1>(0, 0) = vs.cross(nt);
A_r.block<3, 1>(3, 0) = nt;
ATA_private += A_r * A_r.transpose();
ATb_private += A_r * r;
}
#ifdef _OPENMP
#pragma omp critical
{
#endif
ATA += ATA_private;
ATb += ATb_private;
#ifdef _OPENMP
}
}
#endif
}
The output is
Option 7 took 677.79 ms.
Without OpenMP
Option 7 took 6406.74 ms.
from open3d.
Wow, this is awesome.
Can you put nowait
to other options and test the performance?
We literally have gained 20x speedup today, lol.
from open3d.
#pragma omp for schedule(static)
-> 671.04 ms
#pragma omp for schedule(dynamic)
-> 9282.47 ms.
It is bit weird that dynamic
takes much longer. Any other suggestions you would like to test?
from open3d.
dynamic means that at the end of each for loop, the scheduler allocates a new thread for the next for loop. static means that all for loops are pre-scheduled (in blocks).
dynamic is usually used in very slow and off-balanced for loops.
Can you test the option nowait
? for other options, just simply add nowait to the omp for clause.
from open3d.
#pragma omp for schedule(static) nowait
-> 680.97 ms
#pragma omp for schedule(dynamic) nowait
-> 9056.62 ms.
I think nowait
does not make big difference.
from open3d.
very good. let's go with our fastest option.
from open3d.
Yes. I will modify return value of ComputeJacobianAndResidual
in #41 because we can just return compact JTJ
and JTr
instead of heavy J
and r
from open3d.
I am thinking loudly.
I actually think this optimization is really handy and useful. And we do have many functions that operate in this way. E.g., point to plane ICP, odometry, anywhere that needs to compute Jacobian and residual.
Instead of copying this code around, can we just make it a universal function in Core/Eigen.h? We pass a function as parameter that computes A_r and r to this universal function, and the universal function takes care of the rest.
This needs a bit of hack, in particular, std::bind
. E.g.,
template<typename MatType, typename VecType>
std::tuple<MatType, VecType> ComputeJacobianAndResidual(std::function<void (int, VecType &, double &)> f, int iteration_num)
{
MatType ATA;
VecType ATb;
ATA.setZero();
ATb.setZero();
#ifdef _OPENMP
#pragma omp parallel
{
#endif
MatType ATA_private;
VecType ATb_private;
ATA_private.setZero();
ATb_private.setZero();
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (int i = 0; i < iteration_num; i++) {
double r;
VecType A_r;
f(i, A_r, r);
ATA_private += A_r * A_r.transpose();
ATb_private += A_r * r;
}
#ifdef _OPENMP
#pragma omp critical
{
#endif
ATA += ATA_private;
ATb += ATb_private;
#ifdef _OPENMP
}
}
#endif
return std::make_tuple(std::move(ATA), std::move(ATb));
}
Then when we need to call this, we use std::bind
:
void ComputeSomething(int i, Eigen::Vector6d &A_r, double &r, const std::vector<double> &data)
{
const Eigen::Vector3d &vs = data[i];
const Eigen::Vector3d &vt = data[i];
const Eigen::Vector3d &nt = data[i];
r = (vs - vt).dot(nt);
A_r.block<3, 1>(0, 0) = vs.cross(nt);
A_r.block<3, 1>(3, 0) = nt;
}
...
auto f = std::bind(ComputeSomething, std::placeholders::_1, std::placeholders::_2, std::placeholders::_3, some_data);
ComputeJacobianAndResidual(f, (int)some_data.size());
...
from open3d.
Nice idea! but I found out it gets slower.
Our best option took 654.12 ms.
Function binding based implementation took 2594.94 ms.
I checked that the two implementation outputs the same result.
I am thinking. Probably it is because of function call overhead?
from open3d.
Umm, can you:
-
Check the performance of the single threaded version?
-
Turn on/off different code blocks and see what causes the overhead.
For example, you can make f() an empty function (does nothing), and see how is the performance. -
I hope 2 can find out the reason. Or we can spend a bit more time profiling it.
from open3d.
Sure. Here are benchmark results.
Our best option (single thread) took 6432.67 ms.
Function binding (single thread) took 8079.06 ms.
Function binding (single thread, turn off f) took 5854.88 ms.
The time gap between Our best option
and Function binding
gets smaller.
I am profiling the executable.
Data generation : 33.2%
Our best option (single thread) : 29.3%
Function binding (single thread) : 37.6%
In Function binding (single thread)
the two lines takes:
3.9% auto f = std::bind(ComputeSomething, std::placeholders::_1,
std::placeholders::_2, std::placeholders::_3, data);
33.7% std::tie(JTJ, JTb) = ComputeJacobianAndResidualTest<Eigen::Matrix6d,
Eigen::Vector6d>(f, (int)data.size());
In ComputeJacobianAndResidualTest
0.1% for (int i = 0; i < iteration_num; i++) {
double r;
VecType J_r;
13.2% f(i, J_r, r);
13.5% JTJ_private += J_r * J_r.transpose();
3.1% JTb_private += J_r * r;
}
Still it is not clear what makes lagging in ComputeJacobianAndResidualTest
. I am investigating.
from open3d.
I think the size of the data to be bind significantly affects the overhead.
For example, if I generate 10,000,000 data field, the binding is two times slower than our best option. If I generate 100,000,000 data field, the binding is four times slower than our best option.
from open3d.
An idea: set the ComputeSomething function to be inline. In this case, the compiler should explicitly unroll it so that we should get almost equivalent performance.
The idea comes from this post:
https://coderwall.com/p/_3vacg/use-std-function-for-all-your-function-passing-needs
from open3d.
Yep, I actually tried inline
already and forgot to report number.
when I use
inline void ComputeSomething(int i, Eigen::Vector6d &A_r, double &r,
const std::vector<Eigen::Vector3d> &data)
it gets
Function binding took 1646.83 ms.
Still three times slower.
from open3d.
This does not solves the issue we have but this is a tip for Eigen
:
JTJ_private.noalias() += J_r * J_r.transpose();
JTb_private.noalias() += J_r * r;
takes 546.86ms
whereas our previous best setting took 680ms
Reference:
(http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html)
from open3d.
I think I made tiny openmp syntax mistake that changes a lot. Checkout TestOpenMP.exe --test_bind
available from #54. There is a three way of making accumulated JTJ matrix. It basically runs the same for loop, but 1) calls std::bind
function, 2) calls inline function, or 3) doing the job in the for loop directly.
The timings are
Calling binding function took 463.31 ms.
Calling inline function took 384.19 ms.
Direct optration took 489.42 ms.
They are all better than previous version that took 546.86ms
. I am actually surprised the power of inline function as it saves 100ms
than direct operation. #53 also benefits from modified OpenMp code.
from open3d.
Wow, this is amazing.
I am also surprised that the inline version is faster than the direct operation version.
Can you double check this? E.g., 10x the number of loop iterations.
And I think it is a good sign that we can use the std::bind version to wrap the OpenMP code in Eigen.h, right?
from open3d.
10 iterations:
Calling binding function took 4290.90 ms.
Calling function directly took 3867.41 ms.
Direct optration took 4835.18 ms.
It shows similar tendency.
And I think it is a good sign that we can use the std::bind version to wrap the OpenMP code in Eigen.h, right?
I agree. Will do another PR.
from open3d.
Well, It is hard to conclude that inline
is always better than direct operation
.
My Linux workstation says:
Calling binding function took 1840.02 ms.
Calling function directly took 1689.70 ms.
Direct optration took 1678.39 ms.
inline
= direct operation
as we expected. And it is good to know my linux workstation is much faster than my windows workstation :)
Another issue is that if use std::bind
it will add constant overhead. The below code
{
three::ScopeTimer timer("Calling std::bind");
auto f = std::bind(ComputeSomething, std::placeholders::_1,
std::placeholders::_2, std::placeholders::_3, data);
}
Says
Calling std::bind took 474.39 ms.
This is not negligible, and gets more obvious if data
gets bigger. Do you think we need to go with std::bind
anyway?
from open3d.
I added another option: use lambda functions. See #56
And yes, it has exact the same interface as std::bind
but is faster - almost as fast as inline function calls.
I think we are ready to move on.
Let's go with lambda functions.
from open3d.
That's pretty good suggestion. Working on it.
from open3d.
The function has been implemented and used to accelerate the computation.
See #56 #57 #59
It's great.
Future implementations should follow the examples:
DoSingleIteration()
function in
https://github.com/IntelVCL/Open3D/blob/master/src/Core/Odometry/Odometry.cpp
TransformationEstimationPointToPlane::ComputeTransformation()
function in
https://github.com/IntelVCL/Open3D/blob/master/src/Core/Registration/TransformationEstimation.cpp
(Waiting for #59 to be merged into master)
We tested on three platforms:
- Windows + Visual Studio 2015
- Ubuntu + gcc 5.x
- OSX + gcc 6
Same speedup has been achieved. Our final solution is also the fastest solution on all platforms.
We can close this issue now.
from open3d.
Related Issues (20)
- fstring error in summary.py
- The open3d library is linked with the system GLIB_C, which causes several issues with installation. I understand the choice is made for a particular reason, however, this makes the library highly system version dependent.
- `stdgpu` target fails the CMake configure step with CUDA 12.4 due to changes in Thrust version header
- GUI/WebRTC/`ext_civetweb` error on latest MSVC (2022): `invalid numeric argument '/Wextra'`
- Is it possible to access additional vertex properties from ply file?
- v0.18.0 Build failure: fatal errors: multiple defined symbols in 'zlib.dll' & Cannot open include file: 'unzip.h' HOT 1
- How to use Open3D in QT Widget
- Cannot import Open3D0.18 on Windows 11 64 HOT 2
- 'double free or corruption (!prev)' when using .extract_voxel_grid() HOT 1
- Multiple errors related to **tgeometry_kernel** when compiling with CUDA 12.4 HOT 1
- Connected Component Clustering with Octree
- Does open3d have a confirm dialog in gui ?
- PointCloud.points has no elements and read only HOT 3
- open3d initialization failed Prompt:UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 4-5: invalid continuation byte HOT 1
- Segmentation fault when importing first PyTorch 2.2.2 and then Open3D 0.18.0 HOT 1
- Cannot compile on Windows 10 with MinGW-w64 HOT 2
- Why is the human point cloud generated by my icp algorithm noisy
- How to get a colored ointcloud from RGBD images?
- open3d.visualization.Visualizer.get_render_option() return NoneType Object
- Is the box.GetOrientedBoundingBoxLines() deleted in recent version?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open3d.