Giter VIP home page Giter VIP logo

novel-dataset's Introduction

Graph illustration of the novels


Creates a dataset from novelupdates (https://www.novelupdates.com) containing information about translated novels. The dataset contains translated English novels from eight original languages (Chinese, Japanese, Korean, Malaysian, Filipino, Indonesian, Khmer, and Thai). There is currently a total of 21,831 novels.

Both individual novel statistics such as the number of chapters and ranking as well as relations to other novels are available.

Current Version: 0.1.4
Updated on 2024-07-10

Dataset columns:

  • General Information
    • Novel ID
    • Name
    • Associated Names
    • Original Langauge
    • Author / Authors
    • Genres
    • Tags
  • Publishing Information
    • Start Year
    • Licensed
    • Original Publisher
    • English Publisher
  • Chapter Information
    • Number of Chapters (original language)
    • Completed (original language)
    • Number of Chapters (translation)
    • Completed (translation)
  • Release Information (translation)
    • Release Frequency
    • Activity Weekly Rank
    • Activity Monthly Rank
    • Activity All-time Rank
  • Community Information (translation)
    • On Number of Reading Lists
    • Reading List Monthly Rank
    • Reading List All-time Rank
    • Rating
    • Rating Votes
  • Related Series Information
    • Related Series IDs
    • Recommended Series IDs
    • Recommendation List IDs

novel-dataset's People

Contributors

knguy22 avatar shaido987 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

novel-dataset's Issues

Improve code structure

  • Create a class containing all web sites scraping methods (private / public as needed). Only keeping very simple code inside the notebook.
  • Add this to a .py file and create a "code" folder to put both files inside.

Update related/recommended series

Currently "related_series_ids" and "recommended_series_ids" are lists of strings while the "recommendation_list_ids" column contains a list of ints.

Update "related_series_ids" and "recommended_series_ids" to lists of ints for consistency.

Add comments

Add proper docstring comments to all functions.

Suggestions on enhancements

Any suggestions on dataset or code enhancements are welcome.

In particular, if any issues with the dataset, how it is structured or missing information is identified it would be appreciated if pointed out.

Visualization example

Add an image to the readme with an graph.

  • Nodes are novels
  • Nodes colors is the main genres
  • Edges are ???

Some suggestions for edges:

  • same author (too few edges)
  • Related / recommended series (directed graph)
  • On same reading list

Add progress bar

Add a progress bar when running to give an estimate on how long the run will take.

Add the new "Recommendation Lists" information

Novelupdates have added a new field called "Recommendation Lists". Add the ids of these lists to the collected data. It could be useful to connect series together that appears on the same lists.

Restructure code

Move all dataset creation code to scraper.py (maybe remove to web_scraper.py or scrape_dataset.py?). Put the code from the notebook into a if __name__ == "__main__": section.

Parameters that could potentially be used as input parameters (i.e., that can be used as args input):

  • Debug mode
  • Save file or version number

Novelupdates return 403 error

As of now, any requests to novelupdates returns an 403 error even when using cfparser. The cfparser package looks to be dead with no updates in recent years so it would most likely have to be changed to another package. An alternative would be cfscrape however it looks also to be quite dead...

Some further investigation is needed here before the dataset can be updated again.

Novel names missing

The novelupdates website have changed slightly which results in a lot of novel's names to be missing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.