patmejia / spacy-llm Goto Github PK

🦙🪐 fusion of spacy's supervised learning or rule-based components; spacy-llms engaged: text processing, entity extraction & summaries

License: MIT License

Python 100.00%

pytest pytextrank spacy-llm spacy-nlp textrank

spacy-llm's People

Contributors

Stargazers

Watchers

spacy-llm's Issues

Import "spacy" could not be resolved from source Pylance(reportMissingModuleSource)

Debug:

Terminal

Create a virtual environment:

python -m venv .env

Activate the virtual environment:

source .env/bin/activate  # Unix/Linux/Mac
.env\Scripts\activate.bat  # Windows

or, Activate conda (if not already activated):

conda activate spacy-llm

Install the spacy-llm package using conda:

conda install spacy-llm

Validate installation

python -m spacy validate

In VScode:

cmd + p
> Python: Select interpreter + return

select interpreter at workspace level

References:

AttributeError: module 'pytextrank' has no attribute 'TextRank'

`AttributeError: module 'pytextrank' has no attribute 'TextRank'`

reproduce err:

run:

def summarize_text_returns_expected_summary(nlp, text):
    doc = process_text(nlp, text)
    if 'textrank' not in nlp.pipe_names:
        tr = pytextrank.TextRank()
        nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
    doc = nlp(text)
    return [str(sent) for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5)]

omitting the if statement, risks encountering errors when accessing textrank: the script won't check if textrank is present in the pipeline.

error:

AttributeError: module 'pytextrank' has no attribute 'TextRank'

fix:

step_1

check pytextrank installation

pip list | grep pytextrank

step_2

replace:

tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

with:

nlp.add_pipe("textrank")

updated code:

def summarize_text_returns_expected_summary(nlp, text):
    doc = process_text(nlp, text)
    if 'textrank' not in nlp.pipe_names:
        nlp.add_pipe("textrank")
    doc = nlp(text)
    return [str(sent) for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5)]

why?

spacy pipeline: sequence of processing steps (tokenization, POS tagging, NER).

incorrect code manually uses pytextrank.TextRank(), then attempts to add it to the pipeline.

tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

correct code:

nlp.add_pipe("textrank")

auto adds textrank component correctly, ensuring proper registration and accessibility.

adding TextRank to the spacy pipeline registers its methods, attributes, and allows access via ._ on documents (e.g., doc._.textrank.summary()).

notes on `module 'pytextrank' has no attribute 'parse_doc`

a parser is often a necessary component in NLP pipeline.

it can be added to the pipeline alongside PyTextRank.

since:

error msg indicates that the parse_doc function is not found in the pytextrank module. potentially, due to changes in the pytextrank library: some functions might have been removed; or simply, do not exist.

do instead:

load a spacy parser, and add it to the pipeline along pytextrank.

i.e. the spacy small english model en_core_web_sm tokenizes the text before parsing it.

example:

import spacy
import pytextrank
import json

def get_top_ranked_phrases(text):
   nlp = spacy.load("en_core_web_sm")

   nlp.add_pipe("textrank")
   doc = nlp(text)

   top_phrases = []

   for phrase in doc._.phrases:
       top_phrases.append({
           "text": phrase.text,
           "rank": phrase.rank,
           "count": phrase.count,
           "chunks": phrase.chunks
       })

   return top_phrases

sample_text = 'I Like Flipkart. He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.'

top_phrases = get_top_ranked_phrases(sample_text)

for phrase in top_phrases:
   print(phrase["text"], phrase["rank"], phrase["count"], phrase["chunks"])

output:

code notes:

✔︎ load spacy small english model

✔︎ add pytextrank to pipeline

✔︎ store the top-ranked phrases

✔︎ examine the top-ranked phrases in the document

✔︎ print the top-ranked phrases

link to repo: https://github.com/patmejia/spacy-llm

thanks to:

-Paco Nathan
-DerwenAI
-Victoria Stuart
-spacy-pytextrank
-textrank: bringing order into text
-keywords and sentence extraction with textrank (pytextrank)
-模块'pytextrank'没有属性'parse_doc'
-module-pytextrank-has-no-attribute-parse-doc
-scattertext/issues/92

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.