The covid19articles from jyanqa

covid19articles's Introduction

Advanced Clustering Analysis

An application of different clustering methods to cluster 1000 academic articles

Keywords

GMM, K-Means, Model-based Method, NLP, TF–IDF

Abstract

The main objectives of this project are clustering academic articles into different groups. The data set is CORD-19, resource of over 300,000 scholarly articles about COVID19 from the White House and a coalition of research groups. A subset of 1000 articles are spitted from the original data to use in this project.

Each article from the data set will be parsed, cleaned and vectorized. Principle Component Analysis (PCA) is subsequently applied on vectorized data to reduce the dimensions of the data. By then, K-Means and Gaussian Mixture Method are applied to cluster the projected data set. Last but not least, we find the representative keywords of each cluster and visualize them on the word-cloud plot.

The result reveals that there are approximately 17 clusters by K-Means Method and 6 cluster by Gaussian Mixture Method. The clusters determined by K-Means surprisingly performed better than Gaussian Mixture Methods. Each cluster represents a distinctly COVID19-related concern such as International Law and Public Health Policy, Coronavirus and Schooling issue, Symptoms and Treatment and so on.

Report

Full article
Presentation

Author

Nguyen QA

jyanqa

Recommend Projects

jyanqa / covid19articles Goto Github PK

covid19articles's Introduction

Advanced Clustering Analysis

Keywords

Abstract

Report

Author

covid19articles's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent