Giter VIP home page Giter VIP logo

rake_new2's Introduction

PyPI PyPI - Python Version GitHub Maintenance

GitHub issues GitHub forks GitHub stars



ABOUT THIS PROJECT

rake_new2 is a Python library that enables simple and fast keyword extraction from any text. This library helps beginners or those lost while finding keywords, understand which keywords are more important.

HOW IS THIS DIFFERENT FROM ANY OTHER ALGORITHM ? : This library gives you weights/scores along with each keyword/keyphrase. This helps you pick out the correct key-phrases. Just choose the ones with more weights.

Demo

New in version 1.0.5

  1. Handles repetitive keywords/key-phrases

  2. Handles consecutive punctuations.

  3. Handles HTML tags in text : The user is allowed an option to choose if they want to keep HTML tags as keywords too.

Demo 2

Installation

Use the package manager pip to install rake_new2.

pip install rake_new2

Quick Start

from rake_new2 import Rake

text = "Red apples are good in taste."
text2 = "<h1> Hello world !</h1>"
rk,rk_new1,rk_new2 = Rake(),Rake(keep_html_tags=True),Rake(keep_html_tags=False)

# Case 1
# Initialize
rk.get_keywords_from_raw_text(text)
kw_s = rk.get_keywords_with_scores()
# Returns keywords with degree scores : {(1.0, 'taste'), (1.0, 'good'), (4.0, 'red apples')}
kw = rk.get_ranked_keywords()
# Returns keywords only : ['red apples', 'taste', 'good']
f = rk.get_word_freq()
# Returns word frequencies as a Counter object : {'red': 1, 'apples': 1, 'good': 1, 'taste': 1}
deg = rk.get_kw_degree()
# Returns word degrees as defaultdict object : {'red': 2.0, 'apples': 2.0, 'good': 1.0, 'taste': 1.0}

# Case 2 : Sample case for testing the 'keep_html_tags' parameter. Default = False
print("\nORIGINAL TEXT : {}".format(text))
# Sub Case 1 : Keeping the HTMLtags
rk_new1.get_keywords_from_raw_text(text2)
kw_s1 = rk_new1.get_keywords_with_scores()
kw1 = rk_new1.get_ranked_keywords()
print("Keeping the tags : ",kw1)

# Sub Case 2 : Eliminating the HTML tags
rk_new2.get_keywords_from_raw_text(text2)
kw_s2 = rk_new2.get_keywords_with_scores()
kw2 = rk_new2.get_ranked_keywords()
print("Eliminating the tags : ",kw2)

'''OUTPUT >>
ORIGINAL TEXT : <h1> Hello world !</h1>
Keeping the tags :  {'h1', 'hello'}
Eliminating the tags :  {'hello world'}
'''

Debugging

You might come across a stopwords error.

It implies that you do not have the stopwords corpus downloaded from NLTK.

To download it, use the command below.

python -c "import nltk; nltk.download('stopwords')"

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Contributors

Student Name GitHub ID Merged PR No. Open source programme name If DWOC, level of PR
Sabarish Rajamohan sabarish98 #16 Hacktoberfest --
Soham Kar 2bit-hack #20 Hacktoberfest --
Jawen Voon jawsvk #26 Hacktoberfest --
Ananthakrishnan Nair RS akrish4 #47 DWOC Level-1
Tushar Nankani tusharnankani #43 DWOC Level-3

rake_new2's People

Contributors

2bit-hack avatar balaka-18 avatar dependabot[bot] avatar restyled-commits avatar sabarish98 avatar tusharnankani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rake_new2's Issues

Stopwords filter

Description

Edge Case : Since keywords are mainly made by avoiding stopwords, for some cases the keywords extracted do not interpret the meaning of the text exactly.

For example : If text is - "I like sweet apples but I don't like sour apples ", the extracted keywords will say : 'I', 'like', 'sweet', 'sour'. 'apples', with 'apples' being shown as the highest priority keyword. But the meaning gets changed completely if we summarize the keywords.

This issue asks you to work a way around this problem, or brainstorm with me and other interested contributors.

Read : How to use rake_new2

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Create a folder algorithm_addons in the root directory and write a Python function to work around this problem. If I approve, I will create a function directly in the rake_new2.py main file, with the contributors' name on top of the function.

Example naming convention : algorithm_addons/stopwords_debug.py

Acceptance Criteria

  • The .py file must be properly formatted.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by at least 1 mentor.

Time Estimation

Recurring

Literature study on different keyword extraction algorithms

Description

Create an extensive and in-depth literature study on various keyword extraction algorithms other than RAKE and Tf-Idf. Every algorithm must be accompanied by brief logical / mathematical explanation + examples (in text or in the form of pictures / diagrams)

File structure

Create a Literature_Survey.md file in the root directory.

Acceptance Criteria

  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

5 days - 1 week.

Add Hacktoberfest logo to README.md

Description

Add the Hacktoberfest logo to the very start of the README file and add the link to the Contributing_Guidelines.md file and the Issues tab in the repository. The links should read :

  1. 'CLICK HERE TO START CONTRIBUTING' --> Link to the Issues tab.

  2. 'READ THE CONTRIBUTING GUIDELINES' --> Link to the Contributing_Guidelines.md file

Acceptance Criteria

  • README must be properly formatted.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

10 minutes.

cleaning the repository and adding a .gitignore

For now run and build artifacts are inside the repository:
- build
- dist
- __pycache__
- egg-info
They should be remove from the repository.
In order to avoid future unwanted build and run artifacts to be committed in the repository we should add a .gitignore file.
I propose to take the python project gitignore from https://www.toptal.com/developers/gitignore. It's the classic .gitignore file for python project.

If you are ok with those changes I have a branch ready for a pull request!

Shields links are not correct

Shields for pypi, issues, forks and stars should link to the corresponding pages on pypi and github. For now it just links to the images.

UPDATE AND IMPROVE README LOOK

Hi @BALaka-18

Whenever any person visits a project repository , they look to README file first as this file is the most integral part and they like it
I found that this project's README file first lines like the heading of project , a desc of it , and badges this is what it makes look great , and I found there is no heading of the project name in readme and badges are not aligned to center and less badges are used , Let me know if you would like to improve it and give a great look to it

I would be glad to work on it @BALaka-18

Kindly do assign me

rake_new2_before

I can add a heading and a desc and add few more badges and then align all to center

Expecting to work on it and get a good level for it

Create a Pull request Template file.

Description

Create a PULL_REQUEST_TEMPLATE.md file that must contain the skeleton of a PR description and generate the template each time a PR is created.

If you don't know how or what this template is, read this link on creating Pull request templates

Acceptance Criteria

  • PULL_REQUEST_TEMPLATE.md file must be properly created.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

20 minutes

Basic frontend for public web app

Description

Create the frontend of a web application that'll be used to make the library accessible to the users via the web.

The color scheme is upto your choice.

The web app must contain :

  1. A text area for the users to type in the text.
  2. A button that allows users to upload text files if they don't want to type.
  3. Two radio buttons : 1. Keep HTML, 2. Don't keep HTML
  4. A dropdown list with options : 1. Get keywords only, 2. Get keywords with scores, 3. Get top 5 keywords.
  5. Two radio buttons : 1. Show top 5 most frequent words, 2. Don't show frequent words.
  6. A button that links people to the official PyPi page of rake_new2.
  7. A button that has the text : Click to extract keywords.
  8. ALL SEVEN ITEMS MUST BE WRAPPED INSIDE ONE BOX THAT WILL BE CENTERED.
  9. HEADING : WELCOME TO rake_new2 : A PYTHON LIBRARY THAT HELPS IN SMOOTH KEYWORD EXTRACTION.
  10. ON THE TOP RIGHT CORNER, THERE SHOULD BE A DESIGN LIKE THIS(any color) :
    image
    INSIDE THIS SHOULD BE A LOGO OF GITHUB, AND THIS TRIANGLE SHOULD LINK TO : https://github.com/BALaka-18/rake_new2/issues
  11. Below the box(the box stated in point 8.), there should be a bold, legible text that says : Want to contribute ? Have a better idea to enhance our library ? Click on the top right corner of this page.

File structure

Create the files according to convention.

--> All HTML files must be under : web_app/templates/
--> All CSS and JS(if any) files under : web_app/static/

PR INSTRUCTION :

ALL PRs MUST BE MADE TO THE web-app BRANCH ONLY, ELSE THEY WILL BE REJECTED.

Acceptance Criteria

  • All instructions provided in the Description must be strictly followed.
  • Must be neat and formal.
  • All criteria must be satisfied.
  • Must be functioning.
  • PR must follow PR instruction and PR template.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

1 week.

Comparison against TF-IDF Vectorizer (using sklearn)

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

For reference :

For your reference, you may read these :

  1. Keyword extraction
  2. TF-IDF Vectorizer - Sklearn docs

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using sklearn's TfidfVectorizer.

Structure : tfidf_vectorizer/extract_keywords_tfidf_sklearn.py

Acceptance Criteria

  • Code must be properly formatted.
  • Code must be accompanied by appropriate comments.
  • File structure must be strictly maintained.
  • Test cases must be present at the end of the code.
  • Variables and functions must be properly named
  • IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

1.5 hours

Website logo

Description

I have made a logo for the website. The preview of the same is provided below. Kindly assign me this issue so that I can make a PR and work on it under DWoC. @BALaka-18

For reference

Natural   Organic Logo Template with Hand And Leaves
Please review.

Modify README.md

Description

Add the mentioned link in the README file just above the 'Installation' heading. The link should be displayed as 'READ MORE ABOUT RAKE' and should point to the URL : https://monkeylearn.com/keyword-extraction/

Acceptance Criteria

  • README must be properly formatted.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

10 minutes.

Labels to have a description

Description

It would be useful to add descriptions to the labels so that people understand what they’re about and know when to use them.

Test the current algorithm of rake_new2 to look for edge cases

Description

No algorithm can escape edge cases. Your task is to check and test for probable edge cases where you think the algorithm might fail, by trial and error. Test the library on as many texts as you can.

Read : How to use rake_new2

For example : The previous version of this algorithm couldn't handle HTML tags in text. It was resolved in the current version that you see.

NOTE : This may be a multi-assignee issue

Folder Structure, Function details

Create a folder test_cases in the root directory. The folder must contain a .txt file that will contain all the edge cases that you found, with each edge case in a separate line.

Structure : test_cases/edge_cases_file.txt

Acceptance Criteria

  • The .txt file must be properly formatted.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

Recurring

Add Contributors.md file

Hai,
I would love to add the Contributors.md file to your project in the form of a table and display a link to it in the README.md file

Create a CONTRIBUTORS.md file

Description

Create a CONTRIBUTORS.md file that must contain the name of the contributors whose PRs get merged.
Format :

Contributor's GitHub profile picture as a thumbnail || Contributor Name(It must be a link to the contributor's GitHub profile) || Merged PR number.

Acceptance Criteria

  • CONTRIBUTORS.md file must be properly formatted.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

Recurring process

Comparison against TF-IDF Vectorizer (from scratch)

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.

For reference :

For your reference, you may read this link

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.

Structure : tfidf_vectorizer/extract_keywords_tfidf_scratch.py

Acceptance Criteria

  • Code must be properly formatted.
  • Code must be accompanied by appropriate comments.
  • File structure must be strictly maintained.
  • Test cases must be present at the end of the code.
  • Variables and functions must be properly named
  • IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

2.5-3 hours (or more if needed)

Enable GitHub Actions for code coverage.

Description

Integrate the most appropriate GitHub Action for automatic generation of code coverage report on every code related PR made.

NOTE : Once assigned, please comment here as to which GitHub Action you're going to integrate before creating a PR. I will approve the Action, only then you may integrate it.

Acceptance Criteria

  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

20 minutes

CODE OF CONDUCT

We know code of conduct is a very important thing to be followed when many are contributing in a single project ,
Let me give a description about the code of conduct ,

  • There will be a pledge which all need to follow so that this community get a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity etc while contributing and we take this pledge to follow the decorum in conversation in issues also.
  • There will be Standards like :
    - Examples of unacceptable behavior by participants
    - Examples of behavior that contributes to creating a positive environment
  • Responsibilities
  • And if any one faces any issues from other contributors , how to contact ?
  • Scope of code of conduct like This Code of Conduct applies both within project spaces and in public spaces
    These are the things i will be adding , in little more descriptive way
    let me know if you can assign me @BALaka-18

Create Issue Template

Hai,
I would love to add an issue template for your repository. This template would have four issues namely bug, documentation, feature, proposal and question

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.