retextjs / retext-keywords Goto Github PK
View Code? Open in Web Editor NEWplugin to extract keywords and key-phrases
Home Page: https://retextjs.github.io/retext-keywords
License: MIT License
plugin to extract keywords and key-phrases
Home Page: https://retextjs.github.io/retext-keywords
License: MIT License
I think the problem is that the transformer function that is returned get's file as a second parameter, where at the head version of retext the transformer is called with options (null) instead.
Hi, great project btw!
Slight issue using same word in the title, eg:
"How to Get best More Likes on Your Facebook Page " -> "keywords": [ "facebook", "page" ].
But if you put page in the title twice:
"How to Get best More page Likes on Your Facebook Page" -> "keywords": [ "page,page" ].
var Retext = require('retext'),
visit = require('retext-visit'),
keywords = require('retext-keywords'),
sentiment = require('retext-sentiment');
var rt = new Retext().use(visit).use(keywords).use(sentiment);
rt.parse(headline, function (err, tree) {
if (err) return cb(err);
if (tree.length == 0) {
return cb(new Error('Error loading the data!'));
}
var s = []
_.forEach(tree.keywords({ 'minimum': 1 }), function(n) {
s.push(n.nodes.toString());
});
}
Hello! Thanks for this project, it's very useful. I've been using it to extract keywords from apps titles and descriptions at Google Play and iTunes, and I've found that some words that I understand that should be considered stopwords, but have apostrophes in them, aren't filtered by this plugin. Some examples are: it's
, you'll
, you're
, we'll
, can't
, won't
, etc.
For now I'm just adding a string replacement for every new case I find before sending it to the processor, but I was wondering if there's a more generic way to filter them out.
retext-keyword
sqlite> select distinct(tag) from tags where tag LIKE 'it%';
it's
italian
item
items
itext
itoco
it’ll
it’s
it’s-leadership
Here are some urls that parse keyword as "it's" (I'm passing the html to deno-dom then to moz-readability to get article.content
and article.title
it's|https://www.bbc.co.uk/news/science-environment-59268393
it's|https://www.cbsnews.com/news/china-grows-as-campaign-theme-during-coronavirus-pandemic/
it's|https://www.cbsnews.com/news/classic-cars-electric-vehicles-london-mechanic/
it's|https://www.cnbc.com/2021/11/18/inside-cornings-new-vaccine-vial-factory-in-north-carolina.html
it's|https://www.cnn.com/travel/article/beautiful-towns-europe/index.html
it's|https://www.cnn.com/travel/article/dead-sea-shrinks-as-jordan-turns-tide-on-tourism/index.html
it's|https://www.cnn.com/travel/article/uk-tourism-decline-restrictions-cmd/index.html
it's|https://www.cracked.com/article_31747_canadian-children-marched-to-protest-the-rising-price-of-candy.html
it's|https://www.euronews.com/green/2021/11/18/climate-misinformation-is-getting-more-sophisticated-and-experts-say-cop26-progress-could-
it's|https://www.firstshowing.net/2021/watch-remember-a-visual-poem-film-about-interconnectedness/
it's|https://www.freep.com/story/news/local/michigan/2021/11/18/ann-arbor-ordinance-tampons-pads-all-public-bathrooms/8652533002/
it's|https://www.globalcryptopress.com/2021/10/bitcoin-network-holds-over-1-trillion.html
it's|https://www.inc.com/anna-meyer/jennifer-fleiss-rent-the-runway-jetblack-volition-brands.html
it's|https://www.inc.com/joe-sanok/psychologist-joe-sanok-reveals-the-best-parts-of-his-new-book-thursday-is-the-new-friday.html
it's|https://www.inc.com/suzanne-lucas/osha-wont-enforce-covid-rules-pending-court-prepare-anyway.html
it's|https://www.neatorama.com/2021/11/18/Every-Picture-Tells-a-Story-This-One-is-a-Romantic-Comedy/
it's|https://www.npr.org/2021/11/16/1056263648/pfizer-says-it-will-share-the-rights-to-its-covid-19-pill
it's|https://www.npr.org/2021/11/17/1056646740/la-palma-volcano-brings-both-destruction-and-renewal-to-the-island
it's|https://www.rt.com/sport/540633-djokovic-covid-vaccine-status/
it's|https://www.slashfilm.com/664323/jennifer-coolidge-will-star-in-ryan-murphys-the-watcher-tv-series-for-netflix/
it's|https://www.wired.com/gallery/25-amazing-holiday-gift-ideas-under-25-2021/
it's|https://www.wired.com/story/best-black-friday-outdoors-deals-rei-2021/
it's|https://www.wired.com/story/best-buy-early-black-friday-deals-2021-2/
it's|https://www.wired.com/story/early-black-friday-deals-2021/
it's|https://wyrk.com/what-would-you-tear-down-in-buffalo-and-why/
const doc = new DOMParser().parseFromString(body, 'text/html');
article = new Readability(doc).parse();
const nodes = await retext()
.use(retextPos)
.use(retextKeywords)
.process(`${article.title} - ${article.textContent}`);
seems like "it's" is a stop word
Getting "it's" as a keyword
Deno v1
Other (please specify in steps to reproduce)
Linux
import retextKeywords from 'https://cdn.skypack.dev/[email protected]?dts';
When using the default example given in the readme, adding an options
object with maximum
value causes an error.
OSX Mojave 10.14.4
Package | Version |
---|---|
retext |
^7.0.1 |
retext-keywords |
^5.0.0 |
nlcst-to-string |
latest |
retext-pos |
^2.0.2 |
to-vfile |
^6.0.0 |
Package | Version |
---|---|
node |
8.7.0 |
'npm' | 6.13.1 |
npm instal retext
npm install retext-keywords
npm install nlcst-to-string
npm install retext-pos
npm install to-vfile
index.js
example.txt
file in the same directoryoptions
object to the keywords
with a maximum: 8
node index
We should see a similar console log but with around 8 keywords and phrase
This error is throw:
/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:41
if (node.children && result[0] !== SKIP) {
^
TypeError: Cannot read property 'children' of undefined
at one (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:41:14)
at visitParents (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:26:3)
at visit (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit/index.js:22:3)
at getImportantWords (/Users/Mario/Sites/retext-test/node_modules/retext-keywords/index.js:214:3)
at Function.transformer (/Users/Mario/Sites/retext-test/node_modules/retext-keywords/index.js:17:21)
at freeze (/Users/Mario/Sites/retext-test/node_modules/unified/index.js:118:28)
at Function.process (/Users/Mario/Sites/retext-test/node_modules/unified/index.js:352:5)
at Object.<anonymous> (/Users/Mario/Sites/retext-test/index.js:14:6)
at Module._compile (module.js:624:30)
at Object.Module._extensions..js (module.js:635:10)
Please see attached example project.
retext-test.zip
I have copied and pasted the content of this blog https://codeforgeek.com/2015/01/nodejs-mysql-tutorial/ and passed it as a string in code below.
var retext = require('retext');
var keywords = require('retext-keywords');
var nlcstToString = require('nlcst-to-string');
var phraseData2 = "string of that web page";
retext()
.use(keywords)
.process(phraseData2, function (err, file) {
if (err) throw err;
console.log('Keywords:');
file.data.keywords.forEach(function (keyword) {
console.log(nlcstToString(keyword.matches[0].node));
});
console.log();
console.log('Key-phrases:');
file.data.keyphrases.forEach(function (phrase) {
console.log(phrase.matches[0].nodes.map(nlcstToString).join(''));
});
}
);
it returns blank for keywords and phrases.
Thanks for building this! Do you think you can put the underlying algorithm (RAKE?) in README for easier estimation of big-O?
Hi There -
This is a pretty handy package. I've been using it on the server/local machine just fine.
I've been trying to get this working in AWS lambda, but it keeps seeming to fail, but with no error (the function just times out when trying to require
the module.
Looking through the source code, I can't see anything in there that would cause it to do that (eg: native libraries).
Do you have any insights into this?
Keywords that are identified if written in en-uk are not identified if written in en-us.
Example: favourite is identified as a keyword, but favorite is not.
I tried out a few workarounds but didn't get anywhere. Please let me know if a workaround already exists.
I have noticed a lot of single-word phrases are being returned in some documents. Considering "phrases" is plural I think it should return 2+ word phrases.
It would be nice to be able to select phrases by a number of words or greater than a specific number of words.
return "phases" not single words. Preferably the ability to chose between 3-6 or 2-4 word phrases.
Demo: https://github.com/bebraw/keywords-demo .
Crash
.../node_modules/retext-keywords/index.js:14
retext.use(pos);
^
TypeError: Cannot read property 'use' of undefined
Could it be possible the interface has changed somehow?
I'm getting the following error whenever I want to process a string that contains the word "constructor":
TypeError: Cannot read property 'push' of undefined
at /Users/facundo/dev/gp-keywords/node_modules/retext-keywords/index.js:100:36
at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:72:22)
at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
at visit (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:82:5)
at getImportantWords (/Users/facundo/dev/gp-keywords/node_modules/retext-keywords/index.js:81:5)
You can see the code where I use retext-keywords here. I've first found the issue when passing the string 'Happy Bike Race: sMAShy for WhEEls the WantED - Bridge Road: COnstrucTOR'
to the process
function, but I can reproduce it by just passing 'constructor'
too.
I'm trying to bundle retext-keywords using webpack, and apparently it's tripping on the block in retext-pos:
/*
* Duo and component / npm and component.
*/
try {
posjs = require('pos');
} catch (err) {
/* istanbul ignore next - browser */
posjs = require('pos-js');
}
It looks like webpack sees the require()
statement inside catch and tries to import that file, which of course doesn't exist since pos-js is not an npm module. retext-pos v2.0.0 does not appear to have this kind of try..catch in there.
I found retext-latin plugin, but it not improves result. Can you help?
"United" is always filtered out, it is a very common word used in Country or Organization names
Hi Guys,
hoping you can help me with an issue i'm having. I've created an example using the example code provided so the bug can be reproduced.
the issue is that when analysing text which contains words such as "night’s" e.g. :
"Last night’s concert was the third in a series organised by the Lizz Hobbs Group, who have now produced concerts at Slessor Gardens two years in a row."
the keyphrase is given as "Last night2’2s concert". I'm wondering how i can work around this or resolve so that the 2s are not appearing around the ’ character.
please see the following example code:
var retext = require('retext');
var keywords = require('retext-keywords');
var nlcstToString = require('nlcst-to-string');
var text = "Last night’s concert was the third in a series organised by the Lizz Hobbs Group, who have now produced concerts at Slessor Gardens two years in a row.";
retext()
.use(keywords)
.process(text, function (err, file) {
if (err) throw err;
console.log('Keywords:');
file.data.keywords.forEach(function (keyword) {
console.log(nlcstToString(keyword.matches[0].node));
});
console.log();
console.log('Key-phrases:');
file.data.keyphrases.forEach(function (phrase) {
console.log(phrase.matches[0].nodes.map(nlcstToString).join(''));
});
}
);
output is:
Keywords:
concert
Last
night’s
third
series
Lizz
Hobbs
Group
Slessor
Gardens
two
years
row
Key-phrases:
Last night2’2s concert
Slessor Gardens two years
Lizz Hobbs Group
concerts
see Last night2’2s concert
node v10.10.0
npm 6.4.1
i'd appreciate any pointers as i'm not sure how to approach resolving the issue.
Hi,
What's the algorithm/logic used? I am considering using it to get keywords from text in spanish, would it would?
Thank you!
"retext": "^9.0.0", "retext-keywords": "^8.0.1", "retext-pos": "^5.0.0", "nlcst-to-string": "^4.0.0", "to-vfile": "^8.0.0",
No response
We should be able to see keywords and phrases, but instead, we encounter the following error:
./node_modules/.pnpm/[email protected]/node_modules/unist-util-visit-parents/lib/index.js
Attempted import error: 'color' is not exported from 'unist-util-visit-parents/do-not-use-color' (imported as 'color').
MacOS Sonoma 14.1.2
Next.js
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.