balaka-18 / rake_new2 Goto Github PK

A Python library that enables smooth keyword extraction from any text using the RAKE(Rapid Automatic Keyword Extraction) algorithm.

License: MIT License

Python 100.00%

text text-data keyword-extraction keywords keyword-search nlp python-library

rake_new2's Introduction

ABOUT THIS PROJECT

rake_new2

rake_new2 is a Python library that enables simple and fast keyword extraction from any text. This library helps beginners or those lost while finding keywords, understand which keywords are more important.

HOW IS THIS DIFFERENT FROM ANY OTHER ALGORITHM ? : This library gives you weights/scores along with each keyword/keyphrase. This helps you pick out the correct key-phrases. Just choose the ones with more weights.

New in version 1.0.5

Handles repetitive keywords/key-phrases
Handles consecutive punctuations.
Handles HTML tags in text : The user is allowed an option to choose if they want to keep HTML tags as keywords too.

Installation

Use the package manager pip to install rake_new2.

pip install rake_new2

Quick Start

from rake_new2 import Rake

text = "Red apples are good in taste."
text2 = "<h1> Hello world !</h1>"
rk,rk_new1,rk_new2 = Rake(),Rake(keep_html_tags=True),Rake(keep_html_tags=False)

# Case 1
# Initialize
rk.get_keywords_from_raw_text(text)
kw_s = rk.get_keywords_with_scores()
# Returns keywords with degree scores : {(1.0, 'taste'), (1.0, 'good'), (4.0, 'red apples')}
kw = rk.get_ranked_keywords()
# Returns keywords only : ['red apples', 'taste', 'good']
f = rk.get_word_freq()
# Returns word frequencies as a Counter object : {'red': 1, 'apples': 1, 'good': 1, 'taste': 1}
deg = rk.get_kw_degree()
# Returns word degrees as defaultdict object : {'red': 2.0, 'apples': 2.0, 'good': 1.0, 'taste': 1.0}

# Case 2 : Sample case for testing the 'keep_html_tags' parameter. Default = False
print("\nORIGINAL TEXT : {}".format(text))
# Sub Case 1 : Keeping the HTMLtags
rk_new1.get_keywords_from_raw_text(text2)
kw_s1 = rk_new1.get_keywords_with_scores()
kw1 = rk_new1.get_ranked_keywords()
print("Keeping the tags : ",kw1)

# Sub Case 2 : Eliminating the HTML tags
rk_new2.get_keywords_from_raw_text(text2)
kw_s2 = rk_new2.get_keywords_with_scores()
kw2 = rk_new2.get_ranked_keywords()
print("Eliminating the tags : ",kw2)

'''OUTPUT >>
ORIGINAL TEXT : <h1> Hello world !</h1>
Keeping the tags :  {'h1', 'hello'}
Eliminating the tags :  {'hello world'}
'''

Debugging

You might come across a stopwords error.

It implies that you do not have the stopwords corpus downloaded from NLTK.

To download it, use the command below.

python -c "import nltk; nltk.download('stopwords')"

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Contributors

Student Name	GitHub ID	Merged PR No.	Open source programme name	If DWOC, level of PR
Sabarish Rajamohan	sabarish98	#16	Hacktoberfest	--
Soham Kar	2bit-hack	#20	Hacktoberfest	--
Jawen Voon	jawsvk	#26	Hacktoberfest	--
Ananthakrishnan Nair RS	akrish4	#47	DWOC	Level-1
Tushar Nankani	tusharnankani	#43	DWOC	Level-3

rake_new2's People

Contributors

Stargazers

Watchers

rake_new2's Issues

Stopwords filter

Description

Edge Case : Since keywords are mainly made by avoiding stopwords, for some cases the keywords extracted do not interpret the meaning of the text exactly.

For example : If text is - "I like sweet apples but I don't like sour apples ", the extracted keywords will say : 'I', 'like', 'sweet', 'sour'. 'apples', with 'apples' being shown as the highest priority keyword. But the meaning gets changed completely if we summarize the keywords.

This issue asks you to work a way around this problem, or brainstorm with me and other interested contributors.

Read : How to use rake_new2

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Create a folder algorithm_addons in the root directory and write a Python function to work around this problem. If I approve, I will create a function directly in the rake_new2.py main file, with the contributors' name on top of the function.

Example naming convention : algorithm_addons/stopwords_debug.py

Acceptance Criteria

The .py file must be properly formatted.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by at least 1 mentor.

Time Estimation

Recurring

Literature study on different keyword extraction algorithms

Description

Create an extensive and in-depth literature study on various keyword extraction algorithms other than RAKE and Tf-Idf. Every algorithm must be accompanied by brief logical / mathematical explanation + examples (in text or in the form of pictures / diagrams)

File structure

Create a Literature_Survey.md file in the root directory.

Acceptance Criteria

All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

5 days - 1 week.

Add Hacktoberfest logo to README.md

Description

Add the Hacktoberfest logo to the very start of the README file and add the link to the Contributing_Guidelines.md file and the Issues tab in the repository. The links should read :

'CLICK HERE TO START CONTRIBUTING' --> Link to the Issues tab.
'READ THE CONTRIBUTING GUIDELINES' --> Link to the Contributing_Guidelines.md file

Acceptance Criteria

README must be properly formatted.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

10 minutes.

cleaning the repository and adding a .gitignore

For now run and build artifacts are inside the repository:
- build
- dist
- __pycache__
- egg-info
They should be remove from the repository.
In order to avoid future unwanted build and run artifacts to be committed in the repository we should add a .gitignore file.
I propose to take the python project gitignore from https://www.toptal.com/developers/gitignore. It's the classic .gitignore file for python project.

If you are ok with those changes I have a branch ready for a pull request!

Shields links are not correct

Shields for pypi, issues, forks and stars should link to the corresponding pages on pypi and github. For now it just links to the images.

UPDATE AND IMPROVE README LOOK

Hi @BALaka-18

Whenever any person visits a project repository , they look to README file first as this file is the most integral part and they like it
I found that this project's README file first lines like the heading of project , a desc of it , and badges this is what it makes look great , and I found there is no heading of the project name in readme and badges are not aligned to center and less badges are used , Let me know if you would like to improve it and give a great look to it

I would be glad to work on it @BALaka-18

Kindly do assign me

I can add a heading and a desc and add few more badges and then align all to center

Expecting to work on it and get a good level for it

Create CODE_OF_CONDUCT

Integrate welcome bot

I can add a welcome bot config file having a proper message that will show up when any user will open up an issue or pull-request for the first time.
For reference, check out: https://github.com/apps/welcome

Please assign it to me.

Create a Pull request Template file.

Description

Create a PULL_REQUEST_TEMPLATE.md file that must contain the skeleton of a PR description and generate the template each time a PR is created.

If you don't know how or what this template is, read this link on creating Pull request templates

Acceptance Criteria

PULL_REQUEST_TEMPLATE.md file must be properly created.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

20 minutes

Basic frontend for public web app

Description

Create the frontend of a web application that'll be used to make the library accessible to the users via the web.

The color scheme is upto your choice.

The web app must contain :

A text area for the users to type in the text.
A button that allows users to upload text files if they don't want to type.
Two radio buttons : 1. Keep HTML, 2. Don't keep HTML
A dropdown list with options : 1. Get keywords only, 2. Get keywords with scores, 3. Get top 5 keywords.
Two radio buttons : 1. Show top 5 most frequent words, 2. Don't show frequent words.
A button that links people to the official PyPi page of rake_new2.
A button that has the text : Click to extract keywords.
ALL SEVEN ITEMS MUST BE WRAPPED INSIDE ONE BOX THAT WILL BE CENTERED.
HEADING : WELCOME TO rake_new2 : A PYTHON LIBRARY THAT HELPS IN SMOOTH KEYWORD EXTRACTION.
ON THE TOP RIGHT CORNER, THERE SHOULD BE A DESIGN LIKE THIS(any color) :

INSIDE THIS SHOULD BE A LOGO OF GITHUB, AND THIS TRIANGLE SHOULD LINK TO : https://github.com/BALaka-18/rake_new2/issues
Below the box(the box stated in point 8.), there should be a bold, legible text that says : Want to contribute ? Have a better idea to enhance our library ? Click on the top right corner of this page.

File structure

Create the files according to convention.

--> All HTML files must be under : web_app/templates/
--> All CSS and JS(if any) files under : web_app/static/

PR INSTRUCTION :

ALL PRs MUST BE MADE TO THE web-app BRANCH ONLY, ELSE THEY WILL BE REJECTED.

Acceptance Criteria

All instructions provided in the Description must be strictly followed.
Must be neat and formal.
All criteria must be satisfied.
Must be functioning.
PR must follow PR instruction and PR template.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

1 week.

Comparison against TF-IDF Vectorizer (using sklearn)

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

For reference :

For your reference, you may read these :

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using sklearn's TfidfVectorizer.

Structure : tfidf_vectorizer/extract_keywords_tfidf_sklearn.py

Acceptance Criteria

Code must be properly formatted.
Code must be accompanied by appropriate comments.
File structure must be strictly maintained.
Test cases must be present at the end of the code.
Variables and functions must be properly named
IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

1.5 hours

Adding frontend react app

Add frontend folder that contains react app

Enhance Contribution Guidelines

I will add related images and improve the content of the contribution guidelines.

Website logo

Description

I have made a logo for the website. The preview of the same is provided below. Kindly assign me this issue so that I can make a PR and work on it under DWoC. @BALaka-18

For reference

Please review.

Modify README.md

Description

Add the mentioned link in the README file just above the 'Installation' heading. The link should be displayed as 'READ MORE ABOUT RAKE' and should point to the URL : https://monkeylearn.com/keyword-extraction/

Acceptance Criteria

README must be properly formatted.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

10 minutes.

Labels to have a description

Description

It would be useful to add descriptions to the labels so that people understand what they’re about and know when to use them.

Test the current algorithm of rake_new2 to look for edge cases

Description

No algorithm can escape edge cases. Your task is to check and test for probable edge cases where you think the algorithm might fail, by trial and error. Test the library on as many texts as you can.

Read : How to use rake_new2

For example : The previous version of this algorithm couldn't handle HTML tags in text. It was resolved in the current version that you see.

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Create a folder test_cases in the root directory. The folder must contain a .txt file that will contain all the edge cases that you found, with each edge case in a separate line.

Structure : test_cases/edge_cases_file.txt

Acceptance Criteria

The .txt file must be properly formatted.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

Recurring

Make a UI design for website

Make UI for the home page of the website using any of the UI designing tools like Figma
some links for references:-
https://templatemo.com/tm-540-lava-landing-page

Add Contributors.md file

Hai,
I would love to add the Contributors.md file to your project in the form of a table and display a link to it in the README.md file

Create a CONTRIBUTORS.md file

Description

Create a CONTRIBUTORS.md file that must contain the name of the contributors whose PRs get merged.
Format :

Contributor's GitHub profile picture as a thumbnail || Contributor Name(It must be a link to the contributor's GitHub profile) || Merged PR number.

Acceptance Criteria

CONTRIBUTORS.md file must be properly formatted.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

Recurring process

Comparison against TF-IDF Vectorizer (from scratch)

Description

NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.

For reference :

For your reference, you may read this link

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.

Structure : tfidf_vectorizer/extract_keywords_tfidf_scratch.py

Acceptance Criteria

Code must be properly formatted.
Code must be accompanied by appropriate comments.
File structure must be strictly maintained.
Test cases must be present at the end of the code.
Variables and functions must be properly named
IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

2.5-3 hours (or more if needed)

Enable GitHub Actions for code coverage.

Description

Integrate the most appropriate GitHub Action for automatic generation of code coverage report on every code related PR made.

NOTE : Once assigned, please comment here as to which GitHub Action you're going to integrate before creating a PR. I will approve the Action, only then you may integrate it.

Acceptance Criteria

All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

20 minutes

CODE OF CONDUCT

We know code of conduct is a very important thing to be followed when many are contributing in a single project ,
Let me give a description about the code of conduct ,

There will be a pledge which all need to follow so that this community get a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity etc while contributing and we take this pledge to follow the decorum in conversation in issues also.
There will be Standards like :
- Examples of unacceptable behavior by participants
- Examples of behavior that contributes to creating a positive environment
Responsibilities
And if any one faces any issues from other contributors , how to contact ?
Scope of code of conduct like This Code of Conduct applies both within project spaces and in public spaces
These are the things i will be adding , in little more descriptive way
let me know if you can assign me @BALaka-18

Improve Contributor.md

Create Issue Template

Hai,
I would love to add an issue template for your repository. This template would have four issues namely bug, documentation, feature, proposal and question

balaka-18 / rake_new2 Goto Github PK

rake_new2's Introduction

ABOUT THIS PROJECT

New in version 1.0.5

Installation

Quick Start

Debugging

Contributing

License

Contributors

rake_new2's People

Contributors

Stargazers

Watchers

Forkers

rake_new2's Issues

Description

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Acceptance Criteria

Definition of Done

Time Estimation

Description

File structure

Acceptance Criteria

Definition of Done

Time Estimation

Description

Acceptance Criteria

Definition of Done

Time Estimation

Description

Acceptance Criteria

Definition of Done

Time Estimation

Description

File structure

Acceptance Criteria

Definition of Done

Time Estimation

Description

For reference :

Folder Structure, Function details

Acceptance Criteria

Definition of Done

Time Estimation

Description

For reference

Description

Acceptance Criteria

Definition of Done

Time Estimation

Description

Description

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Acceptance Criteria

Definition of Done

Time Estimation

Description

Acceptance Criteria

Definition of Done

Time Estimation

Description

For reference :

Folder Structure, Function details

Acceptance Criteria

Definition of Done

Time Estimation

Description

NOTE : Once assigned, please comment here as to which GitHub Action you're going to integrate before creating a PR. I will approve the Action, only then you may integrate it.

Acceptance Criteria

Definition of Done

Time Estimation

Recommend Projects

Recommend Topics

Recommend Org