yooper / php-text-analysis Goto Github PK

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

Home Page: https://github.com/yooper/php-text-analysis/wiki

License: MIT License

PHP 99.93% Shell 0.07%

nlp php tokenization php-text-analysis php-language text-analysis

php-text-analysis's People

Contributors

Stargazers

Watchers

Forkers

mirzap zoka123 sedrati tf-studio tymiles003 sridharseshadri ddunford zhlousek wpottier fiazhusyn johnrivelt fiqih24 compains roysegall dennisdeswart molbi medehghani benedict-erwin gsdu8g9 thebigape maxguru petrovitch cuttsey lucasnpinheiro forkarea macbre asalem yeungon johnnyduo vuthaihoc waseemkhan123 argonic fairywhyte nduteil tematres andrilaksono djatikusuma ariezncahyo nunodotferreira bungkoko helturkey kjin1 autocar darknet1982 matthewstokeley thiagogomesverissimo euak definef martinojes choonkwang ahmed-atta ace411 scrawlr devuri isnotempty sharpchi berteltorp tykoth sentecode repat costapombo hartmann-lars merong npicado lqb-china cryptonaut420 chamaeleo tabennett danielchicote wuwx taruna28 winekute murattkilinc isaacdaramola ckangwei83 azizbjo naagaraa lyhiving cudinh akky shekhsujan nielsriekert batdan nixes emojized evertharmeling distro-io ignatius-n

php-text-analysis's Issues

what methods do you recommend for a simple Q&A bot?

I have a bunch of question and answer as a data-set. I want to develop a simple question answering bot by using this code.
The scenario is when a user ask a new question, the bot finds the most similar ask in the data-set and returns the relative answer to that ask.
I think to do this I must go through this steps:
1-normalizing the data-set questions by performing preprocessing
2-normalizing users new question
3-tokenizing users new question
4-using a stemmer(I dont know how?) on the step 3`s tokens
5-using cosine similarity.

what do you prefer for a bot like this?

question: can I find a signature from text by this code?

I have some texts from some authors. Each one has its own signature or link in the text.

For example author1:
text1:

sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada

text2:

KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf

text3:

jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl

How can I find @jhsad.sadas.com in the text?

EDIT:
@jhsad.sadas.com is an example signature. I don't know what the real signatures of the authors might be! also it has not a format. it can be @jhsad.sadas.com,or visit my blog in fsfsd.sfsf.dfssd , or...
What I have is some text from the author and I know there is a unique signature from that author in their texts.

IDEA:
I thing with converting words to vectors and finding similarity between each texts, we can use cosine similarity to find the signatures.I thing the solution must be some thing like this idea.

Can't install this package with Symfony 5

I would like to use the library with Symfony 5, but I can't install it because of the version constraints.
Is it possible to raise the version's constraints?

False IDF calculation

I think the idf value in \TextAnalysis\Indexes\TfIdf::buildIndex is calculated wrong. With my example I get only zero values. As shown in this article https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ the calculation in line 50 should be:
$value = 1+log(($count)/($value));
(add 1 to log())

Does this repo supports Paraphrasing

I am particularly new to this domain of paraphrasing or re-writing, just started learning and came across this repo, can you tell if this repo supports it? does it have the support to re-write a paragraph? If there is an existing example I can look into it would be helpful in understanding.

Support Wordnet

Implement support for the wordnet data set.

I use your examples but it does not work

hi.I was looking for a text mining code in python and I saw this awesome php code.
I installed package in ubuntu 16.04.3:
sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant
then I used composer install. after it finished I went to test folder:

gn@me ~/c/p/tests> php TestBaseCase.php 
PHP Fatal error:  Class 'PHPUnit_Framework_TestCase' not found in /home/gn/code/php-text-analysis/tests/TestBaseCase.php on line 13

also I used tokenizer as bellow:

<?php
use TextAnalysis\Tokenizers\GeneralTokenizer;


        $tokenizer = new GeneralTokenizer();
        $text1 = $tokenizer->tokenize('hi, how are you');
        $text2 = $tokenizer->tokenize('hello, thank you')  ;

and it returned:

gn@me ~/c/p/tests> php similarity.php 
PHP Fatal error:  Uncaught Error: Class 'TextAnalysis\Tokenizers\GeneralTokenizer' not found in /home/gn/code/php-text-analysis/tests/similarity.php:6
Stack trace:
#0 {main}
  thrown in /home/gn/code/php-text-analysis/tests/similarity.php on line 6

also, I went to src folder and created a similarity.php file:

<?php

        require_once 'Tokenizers/GeneralTokenizer.php'; 
        $tokenizer = new \Tokenizers\GeneralTokenizer();
        $text1 = $tokenizer->tokenize('hi, how are you');
        $text2 = $tokenizer->tokenize('hello, thank you')  ;

and it gaves me this error:
PHP Fatal error: Class 'TextAnalysis\Tokenizers\TokenizerAbstract' not found in /home/gn/code/php-text-analysis/src/Tokenizers/GeneralTokenizer.php on line 11

what is my wrong steps and how can I use code correctly?
thanks

Add Text Summarizer

Poor Vader Sentiment Accuracy. Lots of influential words missing from the vader_lexicon.txt

So, I tried running this implementation of the Vader algorith on this dataset: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Everything I do is: vader(normalize_tokens(tokenize('and . ' . $sample[0]))) (adding 'and . ' as a dummy first word as a workaround for a bug in the library)

Here are the results:

[
"vader" => array:3 [
    "amazon_cells_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 367
      "failed-positive" => 133
      "matched-negative" => 223
      "failed-negative" => 277
      "matched-neutral" => 320
      "matched-%-positive" => 73.4
      "matched-%-negative" => 44.6
    ]
    "imdb_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 364
      "failed-positive" => 136
      "matched-negative" => 233
      "failed-negative" => 267
      "matched-neutral" => 261
      "matched-%-positive" => 72.8
      "matched-%-negative" => 46.6
    ]
    "yelp_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 358
      "failed-positive" => 142
      "matched-negative" => 178
      "failed-negative" => 322
      "matched-neutral" => 350
      "matched-%-positive" => 71.6
      "matched-%-negative" => 35.6
    ]
]

I read how the algorithm works and I liked its simplicity.

However the accuracy in the upper example seems to be extremely poor ! - Mainly because of the lean lexicon.

Are there fuller lexicons for the Vader algorithm ? What can I do to improve accuracy other than that ?
As you can see the accuracy classifying negative sentences is beyond tragic.

Best sentence of a long text

Hello:

It's really a great job, having neat code, and good implementation

I've one question:
Is there a way to get the best sentence of a given long text? (to be put as a short description of the text)

Limited by max words count

Thanks in advanced

Notice & Warning on lines 216, 217, 219 WordnetCorpus.php

I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php

it happens when you call stem() with MorphStemmer class with wordnet corpus:
$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);

how can I use this code for finding text similarity?

hi
I am searching for a piece of code to simply finding similarity between to comments. each comments have 100-300 words.
how can I use this code for cosine similarity or any other method for finding text similarity.
my texts are in persian language, does it matter?

thank you.

Division by zero in FreqDist.php

Warning: Division by zero in \vendor\php-text-analysis\php-text-analysis\src\Analysis\FreqDist.php on line 63

starting with documentation

Hey, do you have a readme or a walkthrough for getting started with helping to write documentation?

Find most similar

What algoritm should I use to find the closest match from a string to a set of strings.

Example of known inputs:

I would like a cheese pizza
I would like a cheese pizza with onions
I would like a cheese pizza without onions

Input I wanna match up and find most similiar, in case there are any similar (in this example there are just spelling mistakes):

I would like a ceese pizza with out onnions.

Using the tokenizer

Hi Dan
I'm wondering how to implement the use of the tokenizer from this toolset. You suggest:
$tokenizer = new GeneralTokenizer();
$tokens = $tokenizer->tokenize("some text")
but how do I instantiate the toolset itself on the php page? I've tried various include/require lines, but none of them work.
Thanks.

UTF8 for normalize_tokens

I modify the helpers.php file with small changes so it supports UTF8, I use the project for analysing norwegian text and this was important for my project
Thanks for so nice library ;)

function normalize_tokens( array $tokens, $normalizer = 'mb_strtolower' ): array {
mb_internal_encoding('UTF-8');
return array_map( $normalizer, $tokens );
}

Type hint

Type hint the functions the functions arguments.

Composer problem OSX

Hi Yooper.
I tried to install the library on OSX but i got this error:

`composer require yooper/php-text-analysis
Using version ^1.0 for yooper/php-text-analysis
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

Problem 1
- Installation request for yooper/php-text-analysis ^1.0 -> satisfiable by yooper/php-text-analysis[v1.0].
- yooper/php-text-analysis v1.0 requires yooper/stop-words dev-master -> satisfiable by yooper/stop-words[dev-master] but these conflict with your requirements or minimum-stability.

Installation failed, reverting ./composer.json to its original content.
`
--bd

Unable to install | Neither Using Shared Hosting Nor Local Host

hello, thanks for the text analyzer but I'm unable to install it.
I'm using a shared hosting which accepts PHP 5.

I tried pasting the files in my server but it is not working.

For example, I've tried Email filter but it was not finding " ITokenTransformation " from :

class EmailFilter implements ITokenTransformation

the "use TextAnalysis\Interfaces\ITokenTransformation;" return an error too

Thanks,
Franz [[email protected]]

issue with laravel composer

I was checking if I could use this within my own project, but when trying to make a proof of concept it seemed to be unable to work with a clean laravel installation.

macbookpro$ composer require yooper/php-text-analysis
Using version ^1.3 for yooper/php-text-analysis
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for yooper/php-text-analysis ^1.3 -> satisfiable by yooper/php-text-analysis[1.3].
    - Conclusion: remove symfony/console v4.0.4
    - Conclusion: don't install symfony/console v4.0.4`

would it be possible to allow symfony console?

FreqDist::getKeyValuesByWeight

php-text-analysis/src/Analysis/FreqDist.php

Lines 119 to 124 in 9b96d25

 $weightPerToken = $this->getWeightPerToken(); 

 //make a copy of the array 

 $keyValuesByWeight = $this->keyValues; 

 array_walk($keyValuesByWeight, function(&$value, $key, $weightPerToken) { 

 $value /= $weightPerToken; 

 }, $this->totalTokens);

Perhaps there is a mistake.

array_walk: If the optional third parameter is supplied, it will be passed as the third parameter to the callback funcname.

So, $weightPerToken inside callback is just $this->totalTokens not $this->getWeightPerToken().

Entity Extraction returns empty array

Hey,

i've started working with your wrapper for the "Stanford Named Entity Extraction", but all i get returned is an empty array. Also there are no error messages.

This is my Code:

            use TextAnalysis\Taggers\StanfordNerTagger;
            use TextAnalysis\Tokenizers\WhitespaceTokenizer;

            $jarpath = [HIDDEN]/stanford-yooper/stanford-ner.jar";
            $classifierPath =[HIDDEN]/stanford-yooper/classifiers/english.all.3class.distsim.crf.ser.gz";
     
            $engText = "Marquette County is a county located in the Upper Peninsula of the US state of Michigan. As of the 2010 census, the population was 67,077.";
            
            $document = new TokensDocument((new WhitespaceTokenizer())->tokenize($engText));
            $tagger = new StanfordNerTagger($jarpath,$classifierPath);
            $output = $tagger->tag($document->getDocumentData());
            var_dump($output); //empty Array

How to run source

I don't know how to run source, i don't see file index or file sample.

Please help me, thks

Available languages

I searching for an NLP package what's working with Hungarian language (and with the other European languages).
Is this package working with them?

Is there a way to get the output in JSON format?

Find most common sequences of words (sentences) in a 10,000-word body?

Hello,

Just wondering if you could help. I am working on a project where I need to go through a website's articles and figure out for each article (2,000 to 10,000 words for each article) what are the most common phrases.

That way, I can improve the internal linking of this website.

Would this be something achievable using php-text-analysis?

Thank you,

Use Case: chatbot

Can I integrate this library to a chatbot to proccess the incoming message?

How can I use the TF-IDF?

Hi, I was experimenting around and found that this library has a TFIDF implementation. Can someone show me an example to get this to work?

What should I put for the DocumentAbstract $document and the $token? And how can I see the result?

Thanks.

PHPNGrams library

Hello,

I just published a library PHPNgrams and a colleague of mine just pointed me this project that I wasn't aware of.
It's very nice !

If you think PHPNgrams might be useful for you, let me know :)

Keep up the good work !

wamania/php-stemmer 1.2

Hello,

i saw you use my lib php-stemmer, thanks for that.
I released version 1.2
I just had Russian, so no pb of compatibility. I think it's better to use the lastest one.

Regards

Trying to access array offset on value of type bool

PHP Version: 7.4.2
File: vendor/yooper/php-text-analysis/src/Sentiment/Vader.php, line 419
Code:

foreach($rows as $row)
{
    $this->lexicon[$row[0]] = $row[1];
}

Problem: $row is not always an array
Temp Fix:

foreach($rows as $row)
{
    if(is_array($row)) {
      $this->lexicon[$row[0]] = $row[1];
    }
}

Matching a phrase in text

Hello!

Being totally new to the complexities of text analysis it's been interesting getting to grips with your library.

Something I've found tricky is a middle ground between an array of individual word array outputs like getKeyValuesByWeight() and the phrase outputs of getKeywordScores()

If I have a keyword to check in a text, e.g. "search engine optimization", it doesn't appear as a key in the output of getKeywordScores() (although the phrase does appear multiple times in the text).

Is there a solution using this library you can suggest ?

PHP 7.4 compatability

Is there any way that this package could be updated to require wamania/php-stemmer ~2 (instead of ~1) for php 7.4 compatibility? I can PR this if you'd like.

Add TextRank Algorithm

Add an implementation of the TextRank algorithm.

Add Naive Bayes Classifier

Sure would be nice if someone wrote or donated a Naive Bayes Classifier to this project.

Multinomial Naive Bayes

Hi,

very appreciate for your work before.
just want to ask, is it can handle for Multinomial Naive Bayes Classification?

Thanks

Support PHP 8.x

Hi,

as a user I would like to be able to use this package en PHP 8 since PHP 7 support is ending in less than 6 months.

Thanks

Entity Text Parser

I'm seeing this aspect of the package as being rather weak. Specifically, I'd like to be able to parse nouns/noun phrases and have a better categorization of them, similar to here:
https://github.com/web64/laravel-nlp#summarization

$entities = NLP::corenlp_entities($text); /* array:6 [ "PERSON" => array:3 [ 0 => "John C. Breckinridge" 1 => "James Buchanan" ] "STATE_OR_PROVINCE" => array:1 [ 0 => "Kentucky" ] "COUNTRY" => array:1 [ 0 => "United States" ] "ORGANIZATION" => array:1 [ 0 => "Confederate States of America" ] "DATE" => array:1 [ 0 => "1857" ] "TITLE" => array:1 [ 0 => "vice president" ] ] */

Unfortunately, this package requires Linux/Ubuntu platform, and I'm working in Windows.
Is there a download or different method for doing this here?

CharFilter not working?

My tokenDoc->toArray gives output below after applying CharFilter, I was expecting to not see single-character elements still in there?

array(15) {                                                                                                                             
  [0] =>                                                                                                                                
  string(1) "i"                                                                                                                         
  [1] =>                                                                                                                                
  string(1) "a"                                                                                                                         
  [2] =>                                                                                                                                
  string(7) "plumber"

Build Word Cloud

Saw this package and noticed on the wiki page it mentions building a word cloud, but the page is empty. https://github.com/yooper/php-text-analysis/wiki/PHP-Keyword-Phrases-Word-Cloud

How could I potentially go about building a word cloud with this package?

Thanks!

Store naive Bayes model

Hi, it is possible store (and load) the trained model? I want to use php-text-analysis to document classification and we cant train the model in each process.
thank you

	$weightPerToken = $this->getWeightPerToken();
	//make a copy of the array
	$keyValuesByWeight = $this->keyValues;
	array_walk($keyValuesByWeight, function(&$value, $key, $weightPerToken) {
	$value /= $weightPerToken;
	}, $this->totalTokens);

yooper / php-text-analysis Goto Github PK

php-text-analysis's People

Contributors

Stargazers

Watchers

Forkers

php-text-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org