Giter VIP home page Giter VIP logo

yooper / php-text-analysis Goto Github PK

View Code? Open in Web Editor NEW
509.0 43.0 88.0 1.03 MB

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

Home Page: https://github.com/yooper/php-text-analysis/wiki

License: MIT License

PHP 99.93% Shell 0.07%
nlp php tokenization php-text-analysis php-language text-analysis

php-text-analysis's People

Contributors

ace411 avatar carbon-cloud-deploy avatar cicnavi avatar elievischel avatar euak avatar evertharmeling avatar maxguru avatar neoblack avatar nielsriekert avatar novemb3r avatar repat avatar thiagogomesverissimo avatar yooper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-text-analysis's Issues

what methods do you recommend for a simple Q&A bot?

I have a bunch of question and answer as a data-set. I want to develop a simple question answering bot by using this code.
The scenario is when a user ask a new question, the bot finds the most similar ask in the data-set and returns the relative answer to that ask.
I think to do this I must go through this steps:
1-normalizing the data-set questions by performing preprocessing
2-normalizing users new question
3-tokenizing users new question
4-using a stemmer(I dont know how?) on the step 3`s tokens
5-using cosine similarity.

what do you prefer for a bot like this?

question: can I find a signature from text by this code?

I have some texts from some authors. Each one has its own signature or link in the text.

For example author1:
text1:

sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada

text2:

KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf

text3:

jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl

How can I find @jhsad.sadas.com in the text?

EDIT:
@jhsad.sadas.com is an example signature. I don't know what the real signatures of the authors might be! also it has not a format. it can be @jhsad.sadas.com,or visit my blog in fsfsd.sfsf.dfssd , or...
What I have is some text from the author and I know there is a unique signature from that author in their texts.

IDEA:
I thing with converting words to vectors and finding similarity between each texts, we can use cosine similarity to find the signatures.I thing the solution must be some thing like this idea.

Does this repo supports Paraphrasing

I am particularly new to this domain of paraphrasing or re-writing, just started learning and came across this repo, can you tell if this repo supports it? does it have the support to re-write a paragraph? If there is an existing example I can look into it would be helpful in understanding.

I use your examples but it does not work

hi.I was looking for a text mining code in python and I saw this awesome php code.
I installed package in ubuntu 16.04.3:
sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant
then I used composer install. after it finished I went to test folder:

gn@me ~/c/p/tests> php TestBaseCase.php 
PHP Fatal error:  Class 'PHPUnit_Framework_TestCase' not found in /home/gn/code/php-text-analysis/tests/TestBaseCase.php on line 13

also I used tokenizer as bellow:

<?php
use TextAnalysis\Tokenizers\GeneralTokenizer;


        $tokenizer = new GeneralTokenizer();
        $text1 = $tokenizer->tokenize('hi, how are you');
        $text2 = $tokenizer->tokenize('hello, thank you')  ;

and it returned:

gn@me ~/c/p/tests> php similarity.php 
PHP Fatal error:  Uncaught Error: Class 'TextAnalysis\Tokenizers\GeneralTokenizer' not found in /home/gn/code/php-text-analysis/tests/similarity.php:6
Stack trace:
#0 {main}
  thrown in /home/gn/code/php-text-analysis/tests/similarity.php on line 6

also, I went to src folder and created a similarity.php file:

<?php

        require_once 'Tokenizers/GeneralTokenizer.php'; 
        $tokenizer = new \Tokenizers\GeneralTokenizer();
        $text1 = $tokenizer->tokenize('hi, how are you');
        $text2 = $tokenizer->tokenize('hello, thank you')  ;

and it gaves me this error:
PHP Fatal error: Class 'TextAnalysis\Tokenizers\TokenizerAbstract' not found in /home/gn/code/php-text-analysis/src/Tokenizers/GeneralTokenizer.php on line 11

what is my wrong steps and how can I use code correctly?
thanks

Poor Vader Sentiment Accuracy. Lots of influential words missing from the vader_lexicon.txt

So, I tried running this implementation of the Vader algorith on this dataset: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Everything I do is: vader(normalize_tokens(tokenize('and . ' . $sample[0]))) (adding 'and . ' as a dummy first word as a workaround for a bug in the library)

Here are the results:

[
"vader" => array:3 [
    "amazon_cells_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 367
      "failed-positive" => 133
      "matched-negative" => 223
      "failed-negative" => 277
      "matched-neutral" => 320
      "matched-%-positive" => 73.4
      "matched-%-negative" => 44.6
    ]
    "imdb_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 364
      "failed-positive" => 136
      "matched-negative" => 233
      "failed-negative" => 267
      "matched-neutral" => 261
      "matched-%-positive" => 72.8
      "matched-%-negative" => 46.6
    ]
    "yelp_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 358
      "failed-positive" => 142
      "matched-negative" => 178
      "failed-negative" => 322
      "matched-neutral" => 350
      "matched-%-positive" => 71.6
      "matched-%-negative" => 35.6
    ]
]

I read how the algorithm works and I liked its simplicity.

However the accuracy in the upper example seems to be extremely poor ! - Mainly because of the lean lexicon.

Are there fuller lexicons for the Vader algorithm ? What can I do to improve accuracy other than that ?
As you can see the accuracy classifying negative sentences is beyond tragic.

Best sentence of a long text

Hello:

It's really a great job, having neat code, and good implementation

I've one question:
Is there a way to get the best sentence of a given long text? (to be put as a short description of the text)

  • Limited by max words count

Thanks in advanced

Notice & Warning on lines 216, 217, 219 WordnetCorpus.php

I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php

it happens when you call stem() with MorphStemmer class with wordnet corpus:
$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);

how can I use this code for finding text similarity?

hi
I am searching for a piece of code to simply finding similarity between to comments. each comments have 100-300 words.
how can I use this code for cosine similarity or any other method for finding text similarity.
my texts are in persian language, does it matter?

thank you.

Find most similar

What algoritm should I use to find the closest match from a string to a set of strings.

Example of known inputs:

I would like a cheese pizza
I would like a cheese pizza with onions
I would like a cheese pizza without onions

Input I wanna match up and find most similiar, in case there are any similar (in this example there are just spelling mistakes):

I would like a ceese pizza with out onnions.

Using the tokenizer

Hi Dan
I'm wondering how to implement the use of the tokenizer from this toolset. You suggest:
$tokenizer = new GeneralTokenizer();
$tokens = $tokenizer->tokenize("some text")
but how do I instantiate the toolset itself on the php page? I've tried various include/require lines, but none of them work.
Thanks.

UTF8 for normalize_tokens

I modify the helpers.php file with small changes so it supports UTF8, I use the project for analysing norwegian text and this was important for my project
Thanks for so nice library ;)

function normalize_tokens( array $tokens, $normalizer = 'mb_strtolower' ): array {
mb_internal_encoding('UTF-8');
return array_map( $normalizer, $tokens );
}

Type hint

Type hint the functions the functions arguments.

Composer problem OSX

Hi Yooper.
I tried to install the library on OSX but i got this error:

`composer require yooper/php-text-analysis
Using version ^1.0 for yooper/php-text-analysis
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

Problem 1
- Installation request for yooper/php-text-analysis ^1.0 -> satisfiable by yooper/php-text-analysis[v1.0].
- yooper/php-text-analysis v1.0 requires yooper/stop-words dev-master -> satisfiable by yooper/stop-words[dev-master] but these conflict with your requirements or minimum-stability.

Installation failed, reverting ./composer.json to its original content.
`
--bd

Unable to install | Neither Using Shared Hosting Nor Local Host

hello, thanks for the text analyzer but I'm unable to install it.
I'm using a shared hosting which accepts PHP 5.

I tried pasting the files in my server but it is not working.

For example, I've tried Email filter but it was not finding " ITokenTransformation " from :

class EmailFilter implements ITokenTransformation

the "use TextAnalysis\Interfaces\ITokenTransformation;" return an error too

Thanks,
Franz [[email protected]]

issue with laravel composer

I was checking if I could use this within my own project, but when trying to make a proof of concept it seemed to be unable to work with a clean laravel installation.

macbookpro$ composer require yooper/php-text-analysis
Using version ^1.3 for yooper/php-text-analysis
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for yooper/php-text-analysis ^1.3 -> satisfiable by yooper/php-text-analysis[1.3].
    - Conclusion: remove symfony/console v4.0.4
    - Conclusion: don't install symfony/console v4.0.4`

would it be possible to allow symfony console?

FreqDist::getKeyValuesByWeight

$weightPerToken = $this->getWeightPerToken();
//make a copy of the array
$keyValuesByWeight = $this->keyValues;
array_walk($keyValuesByWeight, function(&$value, $key, $weightPerToken) {
$value /= $weightPerToken;
}, $this->totalTokens);

Perhaps there is a mistake.

array_walk: If the optional third parameter is supplied, it will be passed as the third parameter to the callback funcname.

So, $weightPerToken inside callback is just $this->totalTokens not $this->getWeightPerToken().

Entity Extraction returns empty array

Hey,

i've started working with your wrapper for the "Stanford Named Entity Extraction", but all i get returned is an empty array. Also there are no error messages.

This is my Code:

            use TextAnalysis\Taggers\StanfordNerTagger;
            use TextAnalysis\Tokenizers\WhitespaceTokenizer;

            $jarpath = [HIDDEN]/stanford-yooper/stanford-ner.jar";
            $classifierPath =[HIDDEN]/stanford-yooper/classifiers/english.all.3class.distsim.crf.ser.gz";
     
            $engText = "Marquette County is a county located in the Upper Peninsula of the US state of Michigan. As of the 2010 census, the population was 67,077.";
            
            $document = new TokensDocument((new WhitespaceTokenizer())->tokenize($engText));
            $tagger = new StanfordNerTagger($jarpath,$classifierPath);
            $output = $tagger->tag($document->getDocumentData());
            var_dump($output); //empty Array

How to run source

I don't know how to run source, i don't see file index or file sample.

Please help me, thks

Available languages

I searching for an NLP package what's working with Hungarian language (and with the other European languages).
Is this package working with them?

Find most common sequences of words (sentences) in a 10,000-word body?

Hello,

Just wondering if you could help. I am working on a project where I need to go through a website's articles and figure out for each article (2,000 to 10,000 words for each article) what are the most common phrases.

That way, I can improve the internal linking of this website.

Would this be something achievable using php-text-analysis?

Thank you,

L

Use Case: chatbot

Can I integrate this library to a chatbot to proccess the incoming message?

How can I use the TF-IDF?

Hi, I was experimenting around and found that this library has a TFIDF implementation. Can someone show me an example to get this to work?

What should I put for the DocumentAbstract $document and the $token? And how can I see the result?

Thanks.

PHPNGrams library

Hello,

I just published a library PHPNgrams and a colleague of mine just pointed me this project that I wasn't aware of.
It's very nice !

If you think PHPNgrams might be useful for you, let me know :)

Keep up the good work !

wamania/php-stemmer 1.2

Hello,

i saw you use my lib php-stemmer, thanks for that.
I released version 1.2
I just had Russian, so no pb of compatibility. I think it's better to use the lastest one.

Regards

Trying to access array offset on value of type bool

  • PHP Version: 7.4.2

  • File: vendor/yooper/php-text-analysis/src/Sentiment/Vader.php, line 419

  • Code:

foreach($rows as $row)
{
    $this->lexicon[$row[0]] = $row[1];
}
  • Problem: $row is not always an array

  • Temp Fix:

foreach($rows as $row)
{
    if(is_array($row)) {
      $this->lexicon[$row[0]] = $row[1];
    }
}

Matching a phrase in text

Hello!

Being totally new to the complexities of text analysis it's been interesting getting to grips with your library.

Something I've found tricky is a middle ground between an array of individual word array outputs like getKeyValuesByWeight() and the phrase outputs of getKeywordScores()

If I have a keyword to check in a text, e.g. "search engine optimization", it doesn't appear as a key in the output of getKeywordScores() (although the phrase does appear multiple times in the text).

Is there a solution using this library you can suggest ?

PHP 7.4 compatability

Is there any way that this package could be updated to require wamania/php-stemmer ~2 (instead of ~1) for php 7.4 compatibility? I can PR this if you'd like.

Multinomial Naive Bayes

Hi,

very appreciate for your work before.
just want to ask, is it can handle for Multinomial Naive Bayes Classification?

Thanks

Support PHP 8.x

Hi,

as a user I would like to be able to use this package en PHP 8 since PHP 7 support is ending in less than 6 months.

Thanks

Entity Text Parser

I'm seeing this aspect of the package as being rather weak. Specifically, I'd like to be able to parse nouns/noun phrases and have a better categorization of them, similar to here:
https://github.com/web64/laravel-nlp#summarization

$entities = NLP::corenlp_entities($text); /* array:6 [ "PERSON" => array:3 [ 0 => "John C. Breckinridge" 1 => "James Buchanan" ] "STATE_OR_PROVINCE" => array:1 [ 0 => "Kentucky" ] "COUNTRY" => array:1 [ 0 => "United States" ] "ORGANIZATION" => array:1 [ 0 => "Confederate States of America" ] "DATE" => array:1 [ 0 => "1857" ] "TITLE" => array:1 [ 0 => "vice president" ] ] */

Unfortunately, this package requires Linux/Ubuntu platform, and I'm working in Windows.
Is there a download or different method for doing this here?

CharFilter not working?

My tokenDoc->toArray gives output below after applying CharFilter, I was expecting to not see single-character elements still in there?

array(15) {                                                                                                                             
  [0] =>                                                                                                                                
  string(1) "i"                                                                                                                         
  [1] =>                                                                                                                                
  string(1) "a"                                                                                                                         
  [2] =>                                                                                                                                
  string(7) "plumber"

Store naive Bayes model

Hi, it is possible store (and load) the trained model? I want to use php-text-analysis to document classification and we cant train the model in each process.
thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.