project4-cuda-icp's Introduction

CUDA Iterative Closest Point

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4

Author: Chhavi Sharma (LinkedIn)
Tested on: Windows 10, Intel Core(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 4GB (MOORE100B-06)
Installation Instructions same as this project
Execution Instructions: run main.cpp ./../data/bun045.ply ./../data/bun000.ply (unresolved issues debug mode, please test in release mode in VS15,2017)

Index

Introduction

In this project, we show different optimzations of the iterative closest point algorithm which can be used to align partially overlapping point clouds of different views of an object. Operations in ICP on large point clouds are highly parallelizable which makes it a good candiate for CUDA based implemantion and optimization.

Iterative closest point algorithm successively estimates and applies rotation and transaltion between two sets of point clouds of different views of an object to achieve the closest alignment. The algorithm iteratively revises the transformation needed to minimize the distance between corresponding points across the two point clouds. ICP depends on an initial guess of the rigid body transformation (Rotation and translation) to acheive good results in case of drastically different views of objects.

Algorithm

Given:

P : M Source points
X : N Target points

At each iteration:

FOr each point in the source, find the closest correspoding point in y based on some metric. We use minimum euclidian distance to assign the closest point.
Now, for a set of correspondances, we estimation the rotation and transaltion between them by solving the orthogonal procrustes problem. The prblem can be formulated as finding the best Rotation and Translation that minimises the average distance between the source and the target points. This optimization can be solved by solving the least squares problem, with the solution being the SVD of the matrix product of the source and target. More precisely:

We do this by mean centring the source and target corrspondances, and then computing the matrix W = transpose(Xmeancntred) * Pmeancntred. Then, the Rotation is U * Transpose(V), where singular value decomposition of W, SVD(W) = USV. Translation, T is Xmean - R * Pmean.
Reapeat until convergence i.e. when predicted Rotation matrix is identity and translation is close to zero.

Implementation Details

Three variations of ICP on have been implmented:

CPU Iterative Closest Point
GPU Iterative Closest Point with Naive Search
GPU Iterative Closest Point with KDTree Search

CPU | GPU Naive | GPU KDTree

CPU Iterative Closest Point

The CPU implementation uses the steps in the above algorithm to iteratively apply roation andtranlation on source data. The correnpondance computation requires an O(M) search for each element in the source point cloud. This is done naively on th CPu where each source points looks through the entire traget set to compute the closest correpondence.

GPU Iterative Closest Point with Naive Search

The CPU implementation is optimised by using a CUDA kernel to perfrom the coresspondance search. Each element in the source point cloud now finds a correspondance in the target point cloud in parallel. Even though this approach is much faster than the CPU version, each point still goes through the entire target data set to pick the closest point.

GPU Iterative Closest Point with KDTree Search

We further optimize each iteration by optimizing the search per source point. We implement a KD-tree structure to search the target 3D point cloud data. A kd tree is a binary tree in which every leaf node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces. Points to the left of this hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree.

The average search time on a KD tree for target data of size n is O(log(n)).

The K-D tree is constructed on the CPU and the stored in a contiguous linear level-order traveral format. It is then transfered to the GPU where the search travel is iterative rather than recursive. CUDA does not support very deep recursiions and therfore an iterative traversal technique to perfrom nearest neighbour search on the KD tree is implemented. To facilitate iterative traveral and backtracking over the tree, a book-keeping array is also maintained. The pseudo code for Nearest neighbour search in KD tree is as follows:-

Search Improvment (Runtime)

CPU -> GPU Naive -> GPU KDTree

O(M*N) -> O(N) -> O(log(N))

Iterative Rendering

The point cloud data is also rendered iteratively to show the chages made by the application of each rotation and translation predicted by the algorithm. This shows that the source object is slowing moving towards the target as a rigid body.

Analysis

The time taken per iteration for the above three cases on the point cloud data of partially overlapping views of the 'Stanford Bunny' is plotted below:-

The data does not have perfect correspondances, and therefore the plots never completely overlap. In most real applications of ICP, the data only has partal overlap and therfore this is a good example set.

The GPU Naive perfromsbetter than CPU ecuase each element is the source looks of the Nearrest neighbour in the target paralelly on CUDA.
In K-D tree seach, the time taken per iteration reduces as the point cloud aligns better with the target. The initial iterations KDtree is slower since the source and target points are highly miss-aligned and each search call goes over the entire tree (worst case time) to find the best neighbour. As the source getss aligned better with the target, the correspondences are found earlier in the tree traversal and the runtime reduces.
However, we see a that the best time on GPU Naive is better than GPU KD-Tree on the current dataset. Even though GPU KD Tree should be theoratically faster (log(n)), the memory overheads of traversing the tree in the current implementation dominate the runtime when the number of points are not very large. The tree traversal is also non-contiguous memory access which causes more fetaches from the global memory than the naive implementation here the search is on contiguous memeory and hence is faster. Therefore, on this dataset the KD tree traversal converges at a higher runtime than the naive approach. The naive search accesses more data but contiguously, whereas, KD tree search jumps around nodes, looking at non-contiguous data and thus takes more time even with lesser comaprasions.

Bloopers

In the left image, error in the rotation computation deformed the point cloud and ICP never converged. In the right image, drastic differnce in point clouds causes ICP to misalign data.

Resources and References

icp.pdf
wiki/K-d_tree
k-d-trees
cmu.kdtrees.pdf
cmu.kdrangenn
gfg.k-dimensional-tree
GTC-2010/pdfs/2140_GTC2010.pdf
s10766-018-0571-0
nghiaho.com

project4-cuda-icp's People

Contributors

Stargazers

Watchers

project4-cuda-icp's Issues

More detailed comments please?

Hi, could u pls provide more detailed comments, especially for the core code?
That will do a good favor for the implementation I believe...

ubuntu

/**
* @file      main.cpp
* @brief     Example Points flocking simulation for CIS 565
* @authors   Liam Boone, Kai Ninomiya, Kangning (Gary) Li
* @date      2013-2017
* @copyright University of Pennsylvania
*/

#include "main.hpp"
#include <time.h>  

// ================
// Configuration
// ================

//toggles 
#define GPUNAIVE 1
#define GPUKDTREE 0

#define dims 3

// LOOK-1.2 - change this to adjust particle count in the simulation
int N_FOR_VIS = 5000; // number of points // reset after reading file
float DT = 0.2f;

std::vector<glm::vec3> Ybuffer;
std::vector<glm::vec3> Xbuffer;

#if GPUKDTREE
glm::vec4 *YbufferTree;
glm::ivec3 *track;
int treesize;
int tracksize;
# endif


// Data read function 
void read_data(std::vector<glm::vec3> &buffer, std::string filename, float offset, bool flag) {

	std::ifstream filein(filename);
	int count = 0;
	int vertices_cnt;
	std::vector<std::string> header_info;

	// ply files have vertex count on line 21
	for (std::string line; std::getline(filein, line); )
	{	// get number of vertices
		if (count == 17) {
			std::istringstream iss(line);
			for (std::string s; iss >> s; )
				header_info.push_back(s);
			vertices_cnt = std::stoi(header_info[2]);
			break;
		}
		count++;
	}
	std::cout << "vertex count :" << vertices_cnt << std::endl;
	filein.clear();
	filein.seekg(0, std::ios::beg);
	count = 0;

	//vertices are stored from line 24 onwards until vcount
	for (std::string line; std::getline(filein, line); )
	{  //24
		if (count >24 + vertices_cnt) break;

		if (count >= 24) {
			std::istringstream is(line);

			float x = 0.0f, y = 0.0f, z = 0.0f;
			is >> x >> y >> z;

			if (flag) {
				//glm::vec3 pt(x+10 , y+5, z+5 );
				//glm::vec3 moved_point = pt;//* glm::mat3( 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0);
				glm::vec3 tmp(x + offset, y + offset, z + offset);
				//std::cout<<x<<" "<<y<<" "<<z<<std::endl;
				buffer.push_back(tmp);
				//buffer.push_back(moved_point);
			}
			else {

				glm::vec3 tmp(x, y, z);
				buffer.push_back(tmp);
			}


			//int idx = (count - 24)*dims;
			//is >> buffer[idx] >> buffer[idx + 1] >> buffer[idx + 2];
			//cout << buffer[idx] << " " << buffer[idx + 1] << " " << buffer[idx + 2] << endl;
		}
		count++;
	}
	std::cout << "data load completed :" << (count - 24 - 1) << std::endl;
	return;
}

/*
*C main function.
*/
int main(int argc, char* argv[]) {
	const char *projectName;
	projectName = "Project4 PICP";

	// Load data into cpu buffers
	printf("** Read Point Cloud Data **\n");

	std::cout << "Data File Y(target): " << argv[1] << std::endl;
	read_data(Ybuffer, argv[1], 0.0f, false);

	std::cout << "Data File X(source): " << argv[2] << std::endl;
	read_data(Xbuffer, argv[2], 0.0f, false);

	// Initialize drawing state
	N_FOR_VIS = Ybuffer.size() + Xbuffer.size();

	std::cout << Ybuffer[0].x << " " << Ybuffer[0].y << " " << Ybuffer[0].z << std::endl;
	std::cout << Xbuffer[0].x << " " << Xbuffer[0].y << " " << Xbuffer[0].z << std::endl;
	std::cout << "total points = " << N_FOR_VIS << std::endl;

#if GPUKDTREE
	std::cout << "Building KDsearch tree for Y" << std::endl;

	//std::vector<glm::vec3> ybuff ={ // test data
	//						glm::vec3(1,7,5),
	//						glm::vec3(2,6,6),
	//						glm::vec3(3,5,7),
	//						glm::vec3(4,4,1),
	//						glm::vec3(5,3,2),
	//						glm::vec3(6,2,3),
	//						glm::vec3(7,1,4)
	//						};

	int size = KDTree::nextPowerOf2(2 * (Ybuffer.size() + 1)); // to store nulls for leaf nodes
	std::vector<glm::vec4> YbuffTree(size, glm::vec4(0.0f));

	// Init mystack
	int sz = (int)log2(size);
	int X = (int)Xbuffer.size();
	std::vector<glm::ivec3> tk(X*sz, glm::ivec3(0, 0, 0));
	track = &tk[0];

	KDTree::initCpuKDTree(Ybuffer, YbuffTree);
	YbufferTree = &YbuffTree[0];

	tracksize = X * sz;
	treesize = size;

# endif

	if (init(argc, argv)) {
		mainLoop();
		Points::endSimulation();
		return 0;
	}
	else {
		return 1;
	}


	return 0;
}

//-------------------------------
//---------RUNTIME STUFF---------
//-------------------------------

std::string deviceName;

/**
* Initialization of CUDA and GLFW.
*/
bool init(int argc, char **argv) {

	// Set window title to "Student Name: [SM 2.0] GPU Name"

	cudaDeviceProp deviceProp;
	int gpuDevice = 1;
	int device_count = 0;

	cudaGetDeviceCount(&device_count);

	if (gpuDevice > device_count) {
		std::cout
			<< "Error: GPU device number is greater than the number of devices!"
			<< " Perhaps a CUDA-capable GPU is not installed?"
			<< std::endl;
		return false;
	}
	cudaGetDeviceProperties(&deviceProp, gpuDevice);
	//int major = deviceProp.major;
	//int minor = deviceProp.minor;

	//std::ostringstream ss;
	//ss << projectName << " [SM " << major << "." << minor << " " << deviceProp.name << "]";
	//deviceName = ss.str();

	Points::initCpuICP(Ybuffer, Xbuffer);
#if GPUKDTREE
	Points::initGPUKD(Ybuffer, Xbuffer, YbufferTree, track, treesize, tracksize);
#elif GPUNAIVE
	Points::initGPU(Ybuffer, Xbuffer);
#endif
	return true;
}

//====================================
// Main loop
//====================================

void runCUDA() {

#if GPUKDTREE
	Points::stepSimulationGPUKD(Ybuffer, Xbuffer, treesize, tracksize, DT);
#elif GPUNAIVE
	Points::stepSimulationGPUNaive(Ybuffer, Xbuffer, DT);
#else
	Points::stepSimulationICPNaive(Ybuffer, Xbuffer, DT);
#endif

}

void mainLoop() {
	Points::unitTest(); // LOOK-1.2 We run some basic example code to make sure
						// your CUDA development setup is ready to go.

	double start, end, cost;
	start = clock();
	for (size_t i = 0; i < 50; i++)
	{
		runCUDA();
	}
	end = clock();
	cost = end - start;
	std::cout << "s:" << cost/ CLOCKS_PER_SEC << std::endl;
}

Recommend Projects

chhavisharma / project4-cuda-icp Goto Github PK