shaido987 / novel-dataset Goto Github PK

Dataset with 10k+ novels.

Python 2.50% Jupyter Notebook 97.50%

novels novelupdates dataset dataset-generation translated-novels hacktoberfest

novel-dataset's Introduction

Creates a dataset from novelupdates (https://www.novelupdates.com) containing information about translated novels. The dataset contains translated English novels from eight original languages (Chinese, Japanese, Korean, Malaysian, Filipino, Indonesian, Khmer, and Thai). There is currently a total of 21,831 novels.

Both individual novel statistics such as the number of chapters and ranking as well as relations to other novels are available.

Current Version: 0.1.4
Updated on 2024-07-10

Dataset columns:

General Information
- Novel ID
- Name
- Associated Names
- Original Langauge
- Author / Authors
- Genres
- Tags
Publishing Information
- Start Year
- Licensed
- Original Publisher
- English Publisher
Chapter Information
- Number of Chapters (original language)
- Completed (original language)
- Number of Chapters (translation)
- Completed (translation)
Release Information (translation)
- Release Frequency
- Activity Weekly Rank
- Activity Monthly Rank
- Activity All-time Rank
Community Information (translation)
- On Number of Reading Lists
- Reading List Monthly Rank
- Reading List All-time Rank
- Rating
- Rating Votes
Related Series Information
- Related Series IDs
- Recommended Series IDs
- Recommendation List IDs

novel-dataset's People

Contributors

Stargazers

Watchers

Forkers

yoyoyonono gunni427 danielblafer knguy22

novel-dataset's Issues

Unify None/NaN/[]

Always use the same value to show that no value exist.

Improve code structure

Create a class containing all web sites scraping methods (private / public as needed). Only keeping very simple code inside the notebook.
Add this to a .py file and create a "code" folder to put both files inside.

List of recommended series ids is incomplete

The list with recommended series ids is missing a lot of entries.

For example, '32837' has 5 recommended series but the 1.0.3 dataset only contains a single entry ('36912') which is an id that is not in the dataset.

Update the final csv file

Run the complete 0.1 version of the program and upload the scraped csv file.

Update related/recommended series

Currently "related_series_ids" and "recommended_series_ids" are lists of strings while the "recommendation_list_ids" column contains a list of ints.

Update "related_series_ids" and "recommended_series_ids" to lists of ints for consistency.

Add support for retrieving metadata from www.wlnupdates.com

Can you add the option to retrieve metadata from www.wlnupdates.com ? It is much more frequently updated and there are some novels not included in the www.novelupdates.com dataset.

You can check wlnupdates.com API here:

https://github.com/fake-name/wlnupdates/blob/master/app/templates/api-docs.md

Add comments

Add proper docstring comments to all functions.

Suggestions on enhancements

Any suggestions on dataset or code enhancements are welcome.

In particular, if any issues with the dataset, how it is structured or missing information is identified it would be appreciated if pointed out.

Visualization example

Add an image to the readme with an graph.

Nodes are novels
Nodes colors is the main genres
Edges are ???

Some suggestions for edges:

same author (too few edges)
Related / recommended series (directed graph)
On same reading list

Add progress bar

Add a progress bar when running to give an estimate on how long the run will take.

More representative graph image

Create a more representative graph image for the readme (i.e., more nodes and better looking).

Latest chapter in original language incorrect

"Status in COO" is currently scraped incorrectly. Mostly None or a single character.

Remove chapter_info dependency on soup

Remove the soup input from the chapter_info function. It should be unnecessary.

Add the new "Recommendation Lists" information

Novelupdates have added a new field called "Recommendation Lists". Add the ids of these lists to the collected data. It could be useful to connect series together that appears on the same lists.

Related series return incorrect

When there are no related series the return right now is NaN while it should be None.

Recommended series is nearly always empty

The list of recommended series is not correctly scraped and will always return an empty list.

Restructure code

Move all dataset creation code to scraper.py (maybe remove to web_scraper.py or scrape_dataset.py?). Put the code from the notebook into a if __name__ == "__main__": section.

Parameters that could potentially be used as input parameters (i.e., that can be used as args input):

Debug mode
Save file or version number

Novelupdates return 403 error

As of now, any requests to novelupdates returns an 403 error even when using cfparser. The cfparser package looks to be dead with no updates in recent years so it would most likely have to be changed to another package. An alternative would be cfscrape however it looks also to be quite dead...

Some further investigation is needed here before the dataset can be updated again.

Novel names missing

The novelupdates website have changed slightly which results in a lot of novel's names to be missing.