This issue is derived from <a class="issue-link js-issue-link" data-error-text="Failed

Related document: <a href="https://eigen.tuxfamily.org/dox/TopicMultiThreading.html" r

I tried <div class="snippet-clipboard-content notranslate position-relative overfl

No, I was suggesting two options: Use <code class="notranslate

Option 1: <div class="snippet-clipboard-content notranslate position-relative over

See this answer: <a href="https://stackoverflow.com/questions/30496365/parallelize

Yeah. The reference helps a lot! <div class="snippet-clipboard-content notranslate

#pragma omp for schedule(static) -> <code class="n

Benchmark for building Jacobian matrix ,about isl-org/open3d

Comments (29)

syncle commented on May 3, 2024

Related document: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html. The key is to use Eigen::initParallel(); before creating threads. I am investing on this.

from open3d.

qianyizh commented on May 3, 2024

Another trick you may want to consider:

For each thread, it has an independent AtA and Atb.
Then at the end we sum them together.

In this case there is no #pragma omp critical inside the loop.

from open3d.

syncle commented on May 3, 2024

I tried

	// Option #5
	// Directly Building JtJ and Jtr, trick for parallelization
	Eigen::Matrix6d ATA;
	Eigen::Vector6d ATb;
	tstart = time(0);
	std::vector<Eigen::Matrix6d> ATA_array;
	std::vector<Eigen::Vector6d> ATb_array;
	ATA_array.resize(NCORR);
	ATb_array.resize(NCORR);
#ifdef _OPENMP
#pragma omp parallel for schedule(static)
#endif
	for (int i = 0; i < NCORR; i++) {
		const Eigen::Vector3d &vs = data[i];
		const Eigen::Vector3d &vt = data[i];
		const Eigen::Vector3d &nt = data[i];
		double r = (vs - vt).dot(nt);
		Eigen::Vector6d A_r;
		A_r.block<3, 1>(0, 0) = vs.cross(nt);
		A_r.block<3, 1>(3, 0) = nt;
		ATA_array[i] = A_r * A_r.transpose();
		ATb_array[i] = A_r * r;
	}
	ATA.setZero();
	ATb.setZero();
	for (int i = 0; i < NCORR; i++) {
		ATA += ATA_array[i];
		ATb += ATb_array[i];
	}
	tend = time(0);
	std::cout << "Option5 took " << difftime(tend, tstart) << " second(s)." << std::endl;

and it gets

Option5 took 20 second(s).

In addition, Eigen::initParallel(); does not change execution time. It is not need to indicate this function if the compiler provide native support of C++11.

from open3d.

qianyizh commented on May 3, 2024

No, I was suggesting two options:

Use #pragma reduction

#pragma omp parallel for schedule(static) reduction(+:ATA, ATr) ...

Create exact omp_get_max_threads() copies of ATA_array and ATb_array.

#ifdef _OPENMP
        int max_thread_num = omp_get_max_threads();
	std::vector<Eigen::Matrix6d> ATA_array(max_threads_num);
	std::vector<Eigen::Vector6d> ATb_array(max_threads_num);
#elseif
	std::vector<Eigen::Matrix6d> ATA_array(1);
	std::vector<Eigen::Vector6d> ATb_array(1);
#endif
        int thread_num = 1;
#pragma omp parallel for schedule(static)
	for (int i = 0; i < NCORR; i++) {
#ifdef _OPENMP
                if (i == 0) {
                     thread_num = omp_get_num_thread();
                }
                int thread_id = omp_get_thread_num();
#elseif
                int thread_id = 0;
#endif
		const Eigen::Vector3d &vs = data[i];
		const Eigen::Vector3d &vt = data[i];
		const Eigen::Vector3d &nt = data[i];
		double r = (vs - vt).dot(nt);
		Eigen::Vector6d A_r;
		A_r.block<3, 1>(0, 0) = vs.cross(nt);
		A_r.block<3, 1>(3, 0) = nt;
		ATA_array[thread_id] = A_r * A_r.transpose();
		ATb_array[thread_id] = A_r * r;
	}
	ATA.setZero();
	ATb.setZero();
	for (int i = 0; i < thread_num; i++) {
		ATA += ATA_array[i];
		ATb += ATb_array[i];
	}
	tend = time(0);
	std::cout << "Option5 took " << difftime(tend, tstart) << " second(s)." << std::endl;

from open3d.

qianyizh commented on May 3, 2024

Another comment:

You can use three::ScopeTimer for benchmark. It is very handy, 👍

from open3d.

syncle commented on May 3, 2024

Option 1:

Severity	Code	Description	Project	File	Line	Suppression State
Error	C3031	'ATb': variable in 'reduction' clause must have scalar arithmetic type	OdometryRGBD	C:\git\Open3D\src\Experimental\OdometryRGBD\OdometryRGBD.cpp	273

I guess Eigen::MatrixXd cannot be used with reduction?

Option 2:

took 1245.81 ms.

Awesome! but I would also like to see Option 1 is working as it is much more compact.

from open3d.

qianyizh commented on May 3, 2024

See this answer:
https://stackoverflow.com/questions/30496365/parallelize-the-addition-of-a-vector-of-matrices-in-openmp

You can also beautify the code of option 2 a bit. Say, put them in a bit omp parallel, and get rid of omp_get_max_threads(). See this for example:
https://stackoverflow.com/questions/30943452/openmp-c-parallel-for-loop-with-reduction-afterwards-best-practice

from open3d.

syncle commented on May 3, 2024

Yeah. The reference helps a lot!

	// Option #7
	Eigen::Matrix6d ATA;
	Eigen::Vector6d ATb;
	ATA.setZero();
	ATb.setZero();
	three::ScopeTimer timer("Option 7");
	{
#ifdef _OPENMP
#pragma omp parallel
{
#endif
	Eigen::Matrix6d ATA_private;
	Eigen::Vector6d ATb_private;
	ATA_private.setZero();
	ATb_private.setZero();
#ifdef _OPENMP
	#pragma omp for nowait
#endif
	for (int i = 0; i < NCORR; i++) {
		const Eigen::Vector3d &vs = data[i];
		const Eigen::Vector3d &vt = data[i];
		const Eigen::Vector3d &nt = data[i];
		double r = (vs - vt).dot(nt);
		Eigen::Vector6d A_r;
		A_r.block<3, 1>(0, 0) = vs.cross(nt);
		A_r.block<3, 1>(3, 0) = nt;
		ATA_private += A_r * A_r.transpose();
		ATb_private += A_r * r;
	}
#ifdef _OPENMP
#pragma omp critical
	{
#endif
	ATA += ATA_private;
	ATb += ATb_private;
#ifdef _OPENMP
	}
}
#endif
	}

The output is

Option 7 took 677.79 ms.

Without OpenMP

Option 7 took 6406.74 ms.

from open3d.

qianyizh commented on May 3, 2024

Wow, this is awesome.

Can you put nowait to other options and test the performance?

We literally have gained 20x speedup today, lol.

from open3d.

syncle commented on May 3, 2024

#pragma omp for schedule(static) -> 671.04 ms
#pragma omp for schedule(dynamic) -> 9282.47 ms.
It is bit weird that dynamic takes much longer. Any other suggestions you would like to test?

from open3d.

qianyizh commented on May 3, 2024

dynamic means that at the end of each for loop, the scheduler allocates a new thread for the next for loop. static means that all for loops are pre-scheduled (in blocks).

dynamic is usually used in very slow and off-balanced for loops.

Can you test the option nowait? for other options, just simply add nowait to the omp for clause.

from open3d.

syncle commented on May 3, 2024

#pragma omp for schedule(static) nowait -> 680.97 ms
#pragma omp for schedule(dynamic) nowait -> 9056.62 ms.
I think nowait does not make big difference.

from open3d.

qianyizh commented on May 3, 2024

very good. let's go with our fastest option.

from open3d.

syncle commented on May 3, 2024

Yes. I will modify return value of ComputeJacobianAndResidual in #41 because we can just return compact JTJ and JTr instead of heavy J and r

from open3d.

qianyizh commented on May 3, 2024

I am thinking loudly.

I actually think this optimization is really handy and useful. And we do have many functions that operate in this way. E.g., point to plane ICP, odometry, anywhere that needs to compute Jacobian and residual.

Instead of copying this code around, can we just make it a universal function in Core/Eigen.h? We pass a function as parameter that computes A_r and r to this universal function, and the universal function takes care of the rest.

This needs a bit of hack, in particular, std::bind. E.g.,

template<typename MatType, typename VecType>
std::tuple<MatType, VecType> ComputeJacobianAndResidual(std::function<void (int, VecType &, double &)> f, int iteration_num)
{
	MatType ATA;
	VecType ATb;
	ATA.setZero();
	ATb.setZero();
#ifdef _OPENMP
#pragma omp parallel
{
#endif
	MatType ATA_private;
	VecType ATb_private;
	ATA_private.setZero();
	ATb_private.setZero();
#ifdef _OPENMP
	#pragma omp for nowait
#endif
	for (int i = 0; i < iteration_num; i++) {
		double r;
		VecType A_r;
                f(i, A_r, r);
		ATA_private += A_r * A_r.transpose();
		ATb_private += A_r * r;
	}
#ifdef _OPENMP
#pragma omp critical
	{
#endif
	ATA += ATA_private;
	ATb += ATb_private;
#ifdef _OPENMP
	}
}
#endif
       return std::make_tuple(std::move(ATA), std::move(ATb));
}

Then when we need to call this, we use std::bind:

void ComputeSomething(int i, Eigen::Vector6d &A_r, double &r, const std::vector<double> &data)
{
		const Eigen::Vector3d &vs = data[i];
		const Eigen::Vector3d &vt = data[i];
		const Eigen::Vector3d &nt = data[i];
                r = (vs - vt).dot(nt);
		A_r.block<3, 1>(0, 0) = vs.cross(nt);
		A_r.block<3, 1>(3, 0) = nt;
}

...
auto f = std::bind(ComputeSomething, std::placeholders::_1, std::placeholders::_2, std::placeholders::_3, some_data);
ComputeJacobianAndResidual(f, (int)some_data.size());
...

from open3d.

syncle commented on May 3, 2024

Nice idea! but I found out it gets slower.

Our best option took 654.12 ms.
Function binding based implementation took 2594.94 ms.

I checked that the two implementation outputs the same result.
I am thinking. Probably it is because of function call overhead?

from open3d.

qianyizh commented on May 3, 2024

Umm, can you:

Check the performance of the single threaded version?
Turn on/off different code blocks and see what causes the overhead.
For example, you can make f() an empty function (does nothing), and see how is the performance.
I hope 2 can find out the reason. Or we can spend a bit more time profiling it.

from open3d.

syncle commented on May 3, 2024

Sure. Here are benchmark results.

Our best option (single thread) took 6432.67 ms.
Function binding (single thread) took 8079.06 ms.
Function binding (single thread, turn off f) took 5854.88 ms.

The time gap between Our best option and Function binding gets smaller.

I am profiling the executable.

Data generation : 33.2%
Our best option (single thread) : 29.3% 
Function binding (single thread) : 37.6%

In Function binding (single thread) the two lines takes:

3.9%		auto f = std::bind(ComputeSomething, std::placeholders::_1,
				std::placeholders::_2, std::placeholders::_3, data);
33.7%		std::tie(JTJ, JTb) = ComputeJacobianAndResidualTest<Eigen::Matrix6d, 
				Eigen::Vector6d>(f, (int)data.size());

In ComputeJacobianAndResidualTest

0.1%		for (int i = 0; i < iteration_num; i++) {
			double r;
			VecType J_r;
13.2%		f(i, J_r, r);
13.5%		JTJ_private += J_r * J_r.transpose();
3.1%		JTb_private += J_r * r;
		}

Still it is not clear what makes lagging in ComputeJacobianAndResidualTest. I am investigating.

from open3d.

syncle commented on May 3, 2024

I think the size of the data to be bind significantly affects the overhead.
For example, if I generate 10,000,000 data field, the binding is two times slower than our best option. If I generate 100,000,000 data field, the binding is four times slower than our best option.

from open3d.

qianyizh commented on May 3, 2024

An idea: set the ComputeSomething function to be inline. In this case, the compiler should explicitly unroll it so that we should get almost equivalent performance.

The idea comes from this post:
https://coderwall.com/p/_3vacg/use-std-function-for-all-your-function-passing-needs

from open3d.

syncle commented on May 3, 2024

Yep, I actually tried inline already and forgot to report number.
when I use

inline void ComputeSomething(int i, Eigen::Vector6d &A_r, double &r, 
		const std::vector<Eigen::Vector3d> &data)

it gets

Function binding took 1646.83 ms.

Still three times slower.

from open3d.

syncle commented on May 3, 2024

This does not solves the issue we have but this is a tip for Eigen:

JTJ_private.noalias() += J_r * J_r.transpose();
JTb_private.noalias() += J_r * r;

takes 546.86ms whereas our previous best setting took 680ms
Reference:
(http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html)

from open3d.

syncle commented on May 3, 2024

I think I made tiny openmp syntax mistake that changes a lot. Checkout TestOpenMP.exe --test_bind available from #54. There is a three way of making accumulated JTJ matrix. It basically runs the same for loop, but 1) calls std::bind function, 2) calls inline function, or 3) doing the job in the for loop directly.
The timings are

Calling binding function took 463.31 ms.
Calling inline function took 384.19 ms.
Direct optration took 489.42 ms.

They are all better than previous version that took 546.86ms. I am actually surprised the power of inline function as it saves 100ms than direct operation. #53 also benefits from modified OpenMp code.

from open3d.

qianyizh commented on May 3, 2024

Wow, this is amazing.
I am also surprised that the inline version is faster than the direct operation version.
Can you double check this? E.g., 10x the number of loop iterations.

And I think it is a good sign that we can use the std::bind version to wrap the OpenMP code in Eigen.h, right?

from open3d.

syncle commented on May 3, 2024

10 iterations:

Calling binding function took 4290.90 ms.
Calling function directly took 3867.41 ms.
Direct optration took 4835.18 ms.

It shows similar tendency.

And I think it is a good sign that we can use the std::bind version to wrap the OpenMP code in Eigen.h, right?

I agree. Will do another PR.

from open3d.

syncle commented on May 3, 2024

Well, It is hard to conclude that inline is always better than direct operation.
My Linux workstation says:

Calling binding function took 1840.02 ms.
Calling function directly took 1689.70 ms.
Direct optration took 1678.39 ms.

inline = direct operation as we expected. And it is good to know my linux workstation is much faster than my windows workstation :)

Another issue is that if use std::bind it will add constant overhead. The below code

{
	three::ScopeTimer timer("Calling std::bind");
	auto f = std::bind(ComputeSomething, std::placeholders::_1,
			std::placeholders::_2, std::placeholders::_3, data);
}

Says

Calling std::bind took 474.39 ms.

This is not negligible, and gets more obvious if data gets bigger. Do you think we need to go with std::bind anyway?

from open3d.

qianyizh commented on May 3, 2024

I added another option: use lambda functions. See #56
And yes, it has exact the same interface as std::bind but is faster - almost as fast as inline function calls.

I think we are ready to move on.
Let's go with lambda functions.

from open3d.

syncle commented on May 3, 2024

That's pretty good suggestion. Working on it.

from open3d.

qianyizh commented on May 3, 2024

The function has been implemented and used to accelerate the computation.
See #56 #57 #59
It's great.

Future implementations should follow the examples:
DoSingleIteration() function in
https://github.com/IntelVCL/Open3D/blob/master/src/Core/Odometry/Odometry.cpp
TransformationEstimationPointToPlane::ComputeTransformation() function in
https://github.com/IntelVCL/Open3D/blob/master/src/Core/Registration/TransformationEstimation.cpp
(Waiting for #59 to be merged into master)

We tested on three platforms:

Windows + Visual Studio 2015
Ubuntu + gcc 5.x
OSX + gcc 6

Same speedup has been achieved. Our final solution is also the fastest solution on all platforms.

We can close this issue now.

from open3d.

Benchmark for building Jacobian matrix about open3d HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent