The similarity-rating from ekinocal

Which language did we choose to implement the project?

The programming language we have chosen is C++17 .

How does the program work?

Defining stop words.

std::set<std::string> stop_words{ "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this".. }; std::stringget_destination_path(conststd::string& output_text, file_compare func) { std::stringpath; while(true) { std::cout<< output_text; getline(std::cin, path); if(std::filesystem::exists(path) && func(path)) break; std::cout<< "Entered a wrong path, please enter the path again!\n"; } returnpath; }

This function is used to find the destination of the documents to be compared for content similarity.

conststd::stringplagiarism_file_path{ get_destination_path("Enter path of plagiarism file : ", std::filesystem::is_regular_file) }; std::ifstream plagiarism_file{ plagiarism_file_path }; if(!plagiarism_file.is_open()) { std::cout<< "Please enter a valid plagiarism file!\n"; std::exit(1); }

To open the plagiarism file.

std::vector<std::string> target_lines; std::stringline; while(getline(plagiarism_file, line, '.')) { std::stringresult; std::stringtransform_result; std::transform(line.begin(), line.end(), std::back_inserter(transform_result), { returnstd::tolower(val); }); std::unique_copy(transform_result.begin(), transform_result.end(), std::back_inserter(result), [](autoval1, autoval2) { returnstd::isspace(val1) && std::isspace(val2); }); result.erase(std::remove_if(result.begin(), result.end(), { returnstd::ispunct(val); }), result.end()); target_lines.emplace_back(result); }

By investigating each character of the files we’re comparing, we aim to find if there is any space characters and if so, the program delete

doubleaccuracy(conststd::set<std::string>& cmp1, conststd::set<std::string>& cmp2) { intaccuracy = 0; for(constauto& item1 : cmp1) { for(constauto& item2: cmp2) { if(item1 == item2) { ++accuracy; break; } return(static_cast<double>(accuracy) / std::max(cmp1.size(), cmp2.size())) * 100.; } } }

Transforming of the destination texts.

std::vector<std::vector<std::string>> source_documents; for(auto& item: document_files) { source_documents.emplace_back(std::vector<std::string>{}); while(getline(item, line, '.')) { std::stringresult;

std::stringtransform_result; std::transform(line.begin(), line.end(), std::back_inserter(transform_result), { returnstd::tolower(val); }); std::unique_copy(transform_result.begin(), transform_result.end(), std::back_inserter(result), [](autoval1, autoval2) { returnstd::isspace(val1) && std::isspace(val2); }); result.erase(std::remove_if(result.begin(), result.end(), { returnstd::ispunct(val); }), result.end()); source_documents.back().emplace_back(result); } }

This part of the code is to output the number of similarity rate after comparison of the files. and library functions are used to access the items.

intcnt{}; for(constauto& item : most_similar_sentences_in_documents) { if(cnt == 0){ std::cout<<"--------------------------------------------------"; } std::cout<<"\n"; std::cout<<"Document #"<< cnt++ << " Similarity Rate => "<< item.first << "%\n"; std::vector<std::pair<std::string, double>> vec; for(constauto&map_item : item.second) { vec.emplace_back(map_item.first, map_item.second); } std::sort(vec.begin(), vec.end(), [](autoval, autoval2) { returnval.second > val2.second; }); std::for_each_n(vec.begin(), 5, { std::cout<< "-"<< val.first << " => [RATE] : "<< val.second << "%\n"; }); std::cout<<"\n"; std::cout<<"--------------------------------------------------"; std::cout<<"\n"; } }

This part is to calculate the duration of execute time of the program.

autostop_timer = std::chrono::high_resolution_clock::now(); autoduration = std::chrono::duration_cast<std::chrono::milliseconds>(stop_timer - start_timer); std::cout<< std::endl; std::cout<< "Finished within "<< duration.count() << " milliseconds "<< std::endl;

Which libraries did we use?

#include #include #include #include #include //C++17 #include #include #include #include #include #include #include

What is the Big-Oh Complexity?

It’s n^2 and in detail Big-Oh of functions: get_an_exist_path O(1) tokenize O(n) accuracy O (n^2) main O ((n^2) * m)

What is the average execution time for the project?

50 milliseconds.

Which methods did we use?

We used binary search from map and from the vector library, we used string matching method

Output:

C:\Users\axsd\CLionProjects\untitled1\cmake-build-debug\untitled1. exe Enter a setof documents path :C:\Users\axsd\CLionProjects\untitled1\document-set Enter plagiarism file path :C:\Users\axsd\CLionProjects\untitled1\english_doc.txt Finished within 50 milliseconds --------------------------------------------------

Document #0Similarity Rate => 100%

at the same time business correspondence syntactical pattern style is drawn to reach agreement between them => [RATE] : 100%
considering that rights andduties must be validated by means of documents => [RATE] : 100%
english documents are characterized from both legal and linguistic points of view => [RATE] : 100%
in view of thisjudicial documents should be drafted correctly => [RATE] : 100%
now that we organized special courses teaching legal writing in thiswork the authors pay attention to business docume nts such as contracts legal correspondence claims etc => [RATE] : 100%

Document #1Similarity Rate => 3.25203%

considering that rights andduties must be validated by means of documents => [RATE] : 18.1818%
english documents are characterized from both legal and linguistic points of view => [RATE] : 18.1818%
in view of thisjudicial documents should be drafted correctly => [RATE] : 18.1818%
the work is aimed at analyzing linguistic peculiarities of drafting legal documents => [RATE] : 18.1818%
we truly believe that teaching documents writing now at university is a very important issue ifwe want to bring up co mprehendible generation of lawyers => [RATE] : 14.2857%

Document #2Similarity Rate => 4.06504%

in view of thisjudicial documents should be drafted correctly => [RATE] : 28.5714%
considering that rights andduties must be validated by means of documents => [RATE] : 25%
english documents are characterized from both legal and linguistic points of view => [RATE] : 25%
the work is aimed at analyzing linguistic peculiarities of drafting legal documents => [RATE] : 22.2222%
some years ago we were notteaching legal writing => [RATE] : 16.6667% -------------------------------------------------- Process finished with exitcode 0

ekinocal / similarity-rating Goto Github PK

similarity-rating's Introduction

similarity-rating's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent