An application of different clustering methods to cluster 1000 academic articles
GMM, K-Means, Model-based Method, NLP, TF–IDF
The main objectives of this project are clustering academic articles into different groups. The data set is CORD-19, resource of over 300,000 scholarly articles about COVID19 from the White House and a coalition of research groups. A subset of 1000 articles are spitted from the original data to use in this project.
Each article from the data set will be parsed, cleaned and vectorized. Principle Component Analysis (PCA) is subsequently applied on vectorized data to reduce the dimensions of the data. By then, K-Means and Gaussian Mixture Method are applied to cluster the projected data set. Last but not least, we find the representative keywords of each cluster and visualize them on the word-cloud plot.
The result reveals that there are approximately 17 clusters by K-Means Method and 6 cluster by Gaussian Mixture Method. The clusters determined by K-Means surprisingly performed better than Gaussian Mixture Methods. Each cluster represents a distinctly COVID19-related concern such as International Law and Public Health Policy, Coronavirus and Schooling issue, Symptoms and Treatment and so on.
Nguyen QA