aymanrb / php-unstructured-text-parser Goto Github PK
View Code? Open in Web Editor NEWA PHP library to help extract text out of text documents that are not structured in a processing friendly manner
License: MIT License
A PHP library to help extract text out of text documents that are not structured in a processing friendly manner
License: MIT License
Hi,
Thanks for an excellent text parser.
I'm having issues with handling patterns where I like to capture the entire string after a selected word, but stop before linefeed.
Example
Offices: New York, London, Paris Feel free to give us a call
My template:
Offices: {%Offices%} Feel free to give us a call
However the the next line would also be included. The only way to avoid this would be to include "Feel" in the template, but my source files are not consistent in what text will follow the "Offices" line. Is there a way to tell the parser to stop at CR/LF?
How closely does the template need to match the original text? Perfectly, or are whitespace differences ignored?
For example, my template might look like this:
Name: {%name}
and my parsed text like this:
Name: Charlie Brown
Would it fail to match because of the difference in whitespace?
That's what I'm currently observing.
Hello, I'm currently using the package it's amazing, but I get a couple of warnings, but the code runs perfectly, tried with different php versions 5.6, 7.0, 7.1 and 7.2 and still got the warning:
Warning: file_get_contents(C:\wamp64\www\ticket-parser/templates.): failed to open stream: Permission denied in C:\wamp64\www\ticket-parser\vendor\aymanrb\php-unstructured-text-parser\src\TextParser.php on line 76
I have access to that folders, it's the same Administrator user.
I did not tried in Linux but I will, any help?
Thanks
Edit:
I've added this to line 76 to validate and don't use 'file_get_contents' on a directory:
if( !is_dir( $fileInfo->getPathname( ) ) ) $templates[$fileInfo->getPathname()] = file_get_contents($fileInfo->getPathname());
I'm glad i stumbled upon this very useful plugin . i have one question,can i parse dynamic data. and how do i handle it on the template file ?
Use case i have a template for orders each order has items(products),meaning one order can have 1 product another one can have 3 or more products,how do i parse each product as a variable? whats the best way to handle this?
There must be a way to define 2 consecutive variables in a plain text document or even an HTML one when the 2 parameters we need to extract are one after the other with no defined separator (just a space or new line for example)
Not Result with ..
`Lieber Kunde,
Ihre Bestellung hat unser Versandlager verlassen und wurde unserem
Logistikpartner DACHSER übergeben.
Die Sendungsnummer zum Auftrag SO123456 (Referenz: xxxxx ) lautet:
123456789
Unter Eingabe der oben genannten Sendungsnummer können Sie durch Klick
auf den folgenden Link den Status Ihrer Sendung einsehen:
DACHSER Kontrollinformationen
https://elogistics.dachser.com/?66fghj
Für Rückfragen stehe ich Ihnen natürlich jederzeit gerne zur Verfügung.
Wichtiger Hinweis zur Warenannahme:
Trotz aller Sorgfalt kann es leider vorkommen, dass die Ware auf dem
Transportweg zu Ihnen Schäden abbekommt.
Prüfen Sie die Verpackung und die Ware daher unbedingt bei Anlieferung
auf Transportschäden, in Anwesenheit des Spediteurs! Jeder Spediteur ist
dazu verpflichtet, die Sichtprüfung abzuwarten. Wenn eine Beschädigung
der Verpackung oder der Ware ersichtlich ist, ist diese Beschädigung mit
einer kurzen Beschreibung, was genau beschädigt ist, auf dem Frachtbrief
zu vermerken und vom Fahrer bestätigen zu lassen. Danach nehmen Sie
bitte schnellstmöglich Kontakt mit uns auf!
Gemeldete Transportschäden ohne einen Vermerk auf den Frachtpapieren
oder verspätet gemeldete Transportschäden können nicht ersetzt werden!
Schöne Grüße aus Bremen`
First of all thanks for giving such a wonderful plugin. i have an small doubt that how can i parse dynamic data. now all the template is in static format ? can u pls advice for this ?
I want to only parse text files that have not be parsed before(from the folder with text files- newer files),how can I best approach this
My template is here.
{%title%}<br />
Türkçe Adı:{%name%}<br />
Soyadı:{%surname%}<br />
Telefon :{%phonenumber%}<br />
Faks:{%zip%}<br />
E-Posta:{%email%}<br />
Ağ:{%website%} <br />
Şirket:{%company%}<br />
Ülke: {%country%}<br />
<br />
{%description%}
And our input here.
020. Tunus'tan safran yağı alım talebi
Tunus'tan safran yağı alım talebi
Türkçe Adı: Zayani Karim
Soyadı:
Telefon: 0021625453108
Faks:
E-Posta: [email protected]
Ağ:
Şirket: الجاذبية للاستيراد والتصدير Yerçekimi İthalat ve İhracat
Ülke: Tunus
اريد شراء زيت عطر الزعفران
Safran yağı almak istiyorum
why doesn't work?
Hey,
Not sure if you can assist with this but figured I'd post an issue in case you have some insight.
I've been using this awesome library for some time [thanks!] but recently ran into some issues with a client that was sending emails encoded in base64 format, and having an issue matching the decoded (plaintext) message to any templates.
I've never had to decode base64 encoded emails until recently, and discovered that simply using imap_base64
wasn't working in terms of getting them to plaintext, and that the decoded message was in HTML format.
So I've attempted to make use of the html2text library in order to convert the decoded base64 messages and remove the HTML formatting, however, the php-unstructured-text-parser doesn't seem to be matching any of the data defined within the template when running it against a data variable that contains the plaintext data formatted by html2text.
However, if I parse an actual email composed with the body of the data created/formatted by this html2text library, it does work.. so I'm somewhat left scratching my head here as to why it won't work when comparing against this same data [stored in a variable].
I'm thinking of just shooting another email out into the queue for parsing (composed of the data generated by html2text) and re-parsing it that way, but this isn't ideal and if you have any suggestions on where I can improve this (or improvements to what I've descibed), let me know!
Also note that I'm using dev-master
branch of the library with my application, since I too was having similar issues parsing plaintext emails and having linebreaks ignored in my templates.
Thanks!
Hello,
I have gotten the following text (12345678 John Anthony Doe) with the names being able to vary, I have tried the template ({%id%} {%name%}) however I end up with the values 12345678 John Anthony and Doe
Do you have any advice on how I can code the template to take the number alone and the rest of the text string?
Thank you in advance
Thanks for your package, it looks very useful :)
Unfortunately we've been attempting to use it with Laravel 10 but it gets hung up on the following dependencies :
- Root composer.json requires aymanrb/php-unstricted-text-parser ^2.3 -> satisfiable by aymanrb/php-unstricted-text-parser[v2.3.0].
- aymanrb/php-unstricted-text-parser v2.3.0 requires psr/log ^1.1 -> found psr/log[1.1.0, …, 1.1.4] but the package is fixed to 3.0.0 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
- monolog/monolog 3.3.1 requires psr/log ^2.0 || ^3.0 found psr/log[2.0.0, 3.0.0] but the package is fixed to 1.0.0 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
- laravel/framework v10.11.0 requires monolog/monolog ^3.0 satisfiable by monolog/monolog[3.3.1].
- laravel/framework is locked to version v10.11.0 and an update of this package was not requested.
I've bumped the dependency locally and it seems to work, will attach a pull request to correct this for your consideration
Please add possibility to change default logs dir. Maybe in constructor as a optional value or in set method
In the fix implemented for issue #10 (released in version 1.2.2) the iterators ignored looking up for "dot" directories fetched by the "DirectoryIterator" injected for template files lookup.
It is better to use FilesystemIterator instead that would avoid the faulty directory lookup as attempted by @carriera in PR #15.
with a variable number of them.
thx
Hi,
thank you for the project.
A question: Is it possible to get back the name of the automated chosen template?
Greets,
Robert
Hello,
$parser = new aymanrb\UnstructuredTextParser\TextParser('application/controllers/template');
$textToParse = 'hallo das ist ein test 4126552 und hier geht es weiter DFGHJKL mit dem Text ...';
//file_put_contents('application/controllers/template/test2.txt',$ordner['body']['html']);
print_r(
$parser
->parseText($textToParse,false )
->getParsedRawData()
);
Template:
hallo das ist ein test {%name%} und hier geht es weiter {%namfe%} mit dem Text ...
Return empty Array
What am I doing wrong?
Mac OS M1
PHP 7.4
Hi there,
Just realized that you released an update a few years ago that allows you to use a regex when targeting data to parse, however, when I try to utilize this, the script appears to be throwing an error:
Got error 'PHP message: PHP Warning: preg_match(): Compilation failed: quantifier does not follow a repeatable item at offset 249 in /vendor/aymanrb/php-unstructured-text-parser/src/TextParser.php on line 68
PHP message: PHP Fatal error: Uncaught TypeError: array_keys() expects parameter 1 to be array, null given in /vendor/aymanrb/php-unstructured-text-parser/src/TextParser.php
I am using $parser->parseText($message)->getParsedRawData();
in conjunction with this, if that helps.
And simply testing trying to extract a phone number from the text, something like +17785542644
using a variable with regex such as {%customer_phone:^\+\d{1,15}$%}
Using a plain variable such as {%customer_phone%}
has no issue, only when I attempt to use a regular expression.
Let me know if you have any insights! Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.