Giter VIP home page Giter VIP logo

psz-crawler's Introduction

psz-crawler

The project was school homework project assignment.
The task is to scrap all the information about music albums released in countries Yugoslavia and Serbia from https://www.discogs.com/ and do some data analysis with it.

Installation

pip install requirements.txt

Scraping

First part is to scrap all the data. I have used scrapy framework for that matter.

Task1

Spider can be started with

cd discogs
scrapy crawl discogs

Be aware that this process can last for many hours. All the downloaded data is stored in MySQL relation database (there are approx. 50k releases and 6k master albums and many more tracks).

Data analysis

Task 2

Task 2 performs couple of SQL queries on the data like which albums had most releases, top 50 persons per rating average in credits, top 50 persons with most appearances as vocal in credits etc.

python analysis/task2/task2.py

Task 3

Task 3 performs SQL queries and visualize the data using plotly. All produced results can be found in ~/charts.

python analysis/task3/task3.py

Task 4

Task 4 performs K-Means algorithm (from scikit-learn) on the data based on features like genre, style, format, year of release, number of tracks and rating. It applies PCA algorithm and plots K clusters in 2D.

python analysis/task4/kmeans.py

psz-crawler's People

Contributors

denkora avatar dependabot[bot] avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.