thomasweinert / fluentdom Goto Github PK

View Code? Open in Web Editor NEW

339.0 16.0 20.0 48.09 MB

A fluent api for working with XML in PHP

Home Page: https://thomas.weinert.info/FluentDOM/

License: MIT License

PHP 99.98% HTML 0.01% Batchfile 0.02%

php xpath jquery-api xml dom fluentdom xmlreader xmlwriter

fluentdom's People

Contributors

Stargazers

Watchers

Forkers

lapistano drx777 beberlei noels westeast johnhamelink alexandrfox ezc ktomk zealotrunner spea nekulin sludovic cahuk shtse8 rahun grukz thewilkybarkid kezenwa burki

fluentdom's Issues

FluentDOM::load($html, 'text/html') crashes the browser

When running this code, the website crashes with error ERR_EMPTY_RESPONSE...

    $html='<a>hello</a>';
    require_once($this->plugin_dir . '_inc/php/autoload.php');
    $fd = FluentDOM::load($html, 'text/html');

The line

    $fd = FluentDOM::load($html, 'text/html');

Makes it crash..
Any idea ?

Select the top level nodes for new Fluent\Query instances.

Select the top level nodes (document element) for new instances if a manipulation function is used before find().

Using html fragments instead of CssSelector wrapping source code in <html> tags

When passing in fragments of html into FluentDOM::QueryCss() it will wrap it with

... . Is it possible not to have the fragment wrapped?

Icalendar loader

Port the ical to xcal converter from Carica Status Monitor to FluentDOM, providing a loader for ical.

Refactor func_get_args() calls

Replace calls using func_get_args() to variadics. This will increase the minimum version requirement to PHP 5.6, so it should be a major release.

Double decode of characters -> bytes.

I think I'm seeing a double decode error on a utf-8 string.

In the test below, the href attribute is a 'RIGHT SINGLE QUOTATION MARK' which is U+2019 aka the bytes e2 80 99 .

When I do $element->getAttribute('href'); the byte values present are c3, a2, c2, 80, c2, 99.

These just happen to be the characters U+00E2, U+0080, U+0099 - i.e. it appears the right quotation mark is decoded to bytes, and then those bytes are then decoded again.

// U+2019 e2 80 99 RIGHT SINGLE QUOTATION MARK
// U+00E2 c3 a2 LATIN SMALL LETTER A WITH CIRCUMFLEX
// U+0080 c2 80
// U+0099 c2 99

<?php

use FluentDOM\Document;
use FluentDOM\Element;

require_once(__DIR__.'/../../vendor/autoload.php');

$rightQuoteMark = "’";

if (!function_exists('getRawCharacters')) {
    function getRawCharacters($result)
    {
        $resultInHex = unpack('H*', $result);
        $resultInHex = $resultInHex[1];
        $resultSeparated = implode(', ', str_split($resultInHex, 2)); //byte safe
        return $resultSeparated;
    }
}

echo "Raw characters are: " . getRawCharacters($rightQuoteMark) . "\n";

$html = <<< HTML
<html>
<body>
<a href="%s"><span>blah</span>
</a>

</body>
</html>
HTML;

$html = sprintf($html, $rightQuoteMark);

$document = new Document();
$document->loadHTML($html);
$linkClosure = function (Element $element) {
    $href = $element->getAttribute('href');
    echo "href chars after parsing are: " . getRawCharacters($href) . "\n" ;
};

$document->find('//a')->each($linkClosure);


// FluentDOM 5.3.0
// "reference": "19c5a3c77c91871d2a2545949b5bde20889fcb45"

Package suggests fluentdom/css-selector

Package suggests fluentdom/css-selector, but it should be fluentdom/selectors-phpcss.

Allow to provide options for HTML/XML loader

Newer libxml version have several options that control the loading process. It could be useful to wrap that options.

Not all options are available widely at the moment. Some emulation for the features like LIBXML_HTML_NODEFDTD might be useful.

Add interface NonDocumentTypeChildNode

https://dom.spec.whatwg.org/#nondocumenttypechildnode

Resolve namespaces in DOMElement methods

Overload all methods that have an *NS version to resolve namespaces using the document defined namespaces. Some are already implemented. Add the missing methods.

getAttribute()
getAttributeNode()
getElementsByTagName()
hasAttribute()
removeAttribute()
setAttribute()
~~setAttributeNode()~~
setIdAttribute()

Css selectors are case-sensetive

I'm not sure if its a FluentDOM issue or not. I believe css selectors should be case-insensitive but the are not.

$fd = FluentDOM::QueryCss('<div></div>')
    ->find('DIV')
    ->text('Hello World!');
echo $fd->document->toHtml(); //returns <div></div> (symfony css converter)

I have an html file which I read in and then feed to FluentDOM::QueryCss.
When I select a cell and read the string value from it, FluentDOM returns "FlÃ¤misch-Brabant" (which is wrong, it should return "Flämisch-Brabant").

Html source:

Any idea how I can fix this?

Cheers,
Wouter

Add Element::find()

Evaluates the expression expecting a node list, but returns a FluentDOM\Query instance.

Usage:

$dom = new FluentDOM\Document();
$dom->loadXml($xml);

foreach ($dom->find('//atom:entry') as $entry) {
  echo $entry->find('atom:title')->text();
}

This allows an alternative access to the fluent api.

Improve error messages for fragment parsing errors

At the moment a parsing error might result in just the message Invalid/empty content parameter.. If this is because of an fatal error in the parsing it would be nice to include information about that error in the message.

Implicit namespaces in html?

Is there any way to have custom tags with implicit name spaces left alone? For example, I have a custom tags along the lines of:

<wt:folder id="123" foo="bar">blah blah</wt:folder>
<wt:person id="1234" />

I use some parsers to convert certain blocks of html to other html structures, each parser processes the html and then ends up returning the finished html with:

$fd = \FluentDOM::load($html, 'text/html');
$fd->registerNamespace('wt', 'urn:wt');
// do some stuff here with the nodes
return new \FluentDOM\HTML5\Serializer($return->document);

Eventually, when all the parsers are finished, I have to seemingly load it back into FluentDOM to be able to get just the content of the body tag (unless there's a way to output the content without the body wrappers and doctype, etc.?):

$fd = \FluentDOM($html, 'text/html');
$fd->registerNamespace('wt', 'urn:wt');
return $fd->find('body')->html();

But it will output the custom tag as something like:

<folder foo="bar" something="other" xmlns:wt="">blah blah</folder>

Is there any way to retain the original format of the tag?

Initialising FluentDOM query changes the underlying html

If I compare the output of

   $fd = FluentDOM::QueryCss($output, 'text/html');
   die($fd);

and

die($output);

I notice that the the output differs. Now, I have not done a single selection or change, only loaded the html and echoed it. What it seems to do is try to close tags but the problem is that it does so incorrectly.
In the middle of a bit of javascript it breaks the document.

This is what the original looks like if I don't run it through FluentDOM at all

   ...

   }).on('error', function (event, id, name, errorReason, xhrOrXdr) {
                            $('#restricted-fine-uploader .flashmessage-error').remove();
                            $('#restricted-fine-uploader').append('<div class="flashmessage flashmessage-error">' + errorReason + '<a class="close" onclick="javascript:$(\'.flashmessage-error\').remove();" >X</a></div>');

   ...

But if it is loaded into FluentDOM and echoed right away this changes to this

   ...

   }).on('error', function (event, id, name, errorReason, xhrOrXdr) {
                            $('#restricted-fine-uploader .flashmessage-error').remove();
                            $('#restricted-fine-uploader').append('<div class="flashmessage flashmessage-error">' + errorReason + '<a class="close" onclick="javascript:$(\'.flashmessage-error\').remove();" >X</script>
</fieldset>
</form>
</div>');

   ...

The closing of the a tag is removed and a closing script tag is instead inserted and several other tags to. My gut feeling makes me think it has something to do with issues handling scripts and text strings within scripts that contain html.

How to get attribute value by using xpath?

Hello, I am using FluentDOM. Now I have a xml example:

<?xml version="1.0" encoding="UTF-8"?>
<a a1="xxx">
    <b bid="p1">
        <c>1</c>
        <d>2</d>
    </b>
    <b bid="p2">
        <c>3</c>
        <c>3</c>
        <c>3</c>
        <d name="k1" value="v1"></d>
        <d name="k2" value="v2"></d>
        <e>5</e>
    </b>
</a>

I want to iterate each element and get attribute 'bid'. Here is my php code:

$nodes = FluentDOM::Query($xml, 'text/xml')
            ->find('/a/b');
foreach ($nodes as $node) {
            $elements = $node->find("./@bid");
            echo count($elements);
        }

It prints out '0', '0', which means there is no result found. I just want to get attribute 'bid',
so can anyone help me point it out?

Support Symfony/CssSelector as CSS to Xpath converter

Currently FluentDOM allows to use CSS Selectors if Carica/PhpCss is found. If PhpCss is not installed, but Symfony/CssSelector use this for to translate the CSS selectors to Xpath.

Reduce unnecessary XPATH evaluation

@f433aa41
find() method always uses option Nodes\Fetcher::UNIQUE

I think UNIQUE is not necessary in many situations, but may cause large amount of calculation, uses too many CPU resources.

This simple find could lead to hundreds of XPATH evaluation.

 $html->find('table#content tr');

Add interface ParentNode

https://dom.spec.whatwg.org/#parentnode

Cleanup and refactor the examples

The examples directory has really grown over the years. As has the FluentDOM API. So the directory needs a major cleanup.

append HTML with <a href... including & fails

Appending a string of html fails if it contains <a href with & in the url.
Error message given is:

Invalid/empty content parameter.

If i first replace the & to _ or some other character the html is appended just fine.

The query is created with.

$fd = FluentDOM::QueryCss($output, 'text/html');

The document is then extracted and passed on

$document = $fd->getDocument();

The document is then used for the actual appending

try {
    FluentDOM::QueryCss($document)->find('button')->parent()->after($MY_HTML_STRING);
}
catch(\Exception $e) {
    die($e->getMessage());
}

A quick test can be done with something like

FluentDOM::QueryCss($document)->find('body')->append('<a href="http://www.google.com?test1=foo&test2=bar">FooBar</a>');

which fail and

FluentDOM::QueryCss($document)->find('body')->append('<a href="http://www.google.com">FooBar</a>');

which works fine.

Can add that the Symphony css selector is used right now. Not sure if it's the same for the other once, but I guess the "issue" is not with the selector but deeper into the library.

Appending two query object

Please consider the following code:

$first = FluentDOM::QueryCss('<input/>');
$second = FluentDOM::QueryCss('<div></div>');

The first line works as expected.

   echo $second->find(':root')->append($first->document->toHtml())->document->toHtml(); // works
   echo $second->find(':root')->append($second->document->toHtml())->document->toHtml(); //failes

But, the second line fails with tho following exception:

Fatal error: Uncaught InvalidArgumentException: Invalid/empty content parameter. in D:\www\www\lab\vendor\fluentdom\fluentdom\src\FluentDOM\Nodes\Builder.php:108
Stack trace:
#0 D:\www\www\lab\vendor\fluentdom\fluentdom\src\FluentDOM\Query.php(242): FluentDOM\Nodes\Builder->getContentNodes('<input>\n')
#1 D:\www\www\lab\vendor\fluentdom\fluentdom\src\FluentDOM\Query.php(273): FluentDOM\Query->apply(Array, '<input>\n', Object(Closure))
#2 D:\www\www\lab\vendor\fluentdom\fluentdom\src\FluentDOM\Query.php(814): FluentDOM\Query->applyToSpawn(Array, '<input>\n', Object(Closure))
#3 D:\www\www\lab\qp-test.php(8): FluentDOM\Query->append('<input>\n')

Is there any workaround? Is there any more concise alternative two append a QueryCss tag to another one?

Add DOMText::replaceWholeText()

Add the replaceWholeText() method FluentDOM\Text and FluentDOM\CdataSection.

https://www.w3.org/TR/DOM-Level-3-Core/core.html#Text3-replaceWholeText

Optionally treat strings as HTML fragments

If the FluentDOM\Query instance is in html mode (content type) treat the provided fragment string as HTML fragments, not XML fragments.

Loading from a DOMNode

Hi,
I have a complex system that performs different handling types on a dom document using PHP's DOMDocument, for one part I have chosen FluentDOM to handle only a special part of the document (a large element). Is it possible to load FluentDom with a DOMNode object?
Right now, we can only load a whole document but, for performance issues, I don't want to reload it again . It would be great if I could pass that special DOMNode to FluentDom.

Something like this is what we do in jquery, where "element" can be a jquery object or a DOM object:
$(element).text('foo bar');

null parameter does not work as expected

When for example $dom->text($text) is invoked while $text=null, the method acts as a getter while it is intended to be a setter (following jQuery). This is also true for attr('foo',null) and others. Sending null and sending no parameter needs to be distinguished. I think public function html($html = NULL) {} need to be converted to public function html() {} and the arguments fetched by func_get_args function.

Implement ArrayAccess in Element

Try to support ArrayAccess in FluentDOM\Element. If the key is an integer or a string of digits the child node should be returned. If it is an string, return the attribute.

$node[42] is $node->childNodes->item(42)

$node['id'] is$node->getAttribute('id')

HTML fragment loader

A loader that loads html fragments, not adding html and body automatically. It might be possible to extend the HTML loader that way.

Loader for JSONx

Add a loader for JSONx. This loader would convert JSONx into JsonDOM, allowing easier Xpath expressions.

<json:object>
  <json:string name="ticker">IBM</json:string>
</json:object>

would be converted to:

<json:json>
  <ticker>IBM</ticker>
</json:json>

If here is a loader, it would make sense to add a serializer, too. So you can save the loaded file into the original format.

How to load from ouput file_get_contentes/curl ?

Hello

as title said, how to do that?.

i try to doing something like this

echo FluentDOM($request->getResponseText())
              ->find('//title')
              ->text();

Where $request->getResponseText() is return from curl, but its give me errors

Warning: DOMDocument::loadXML(): Entity 'eacute' not defined in Entity, line: 6033 in ..vendor\fluentdom\fluentdom\src\FluentDOM\Loader\Xml.php on line 38

Thanks :)

Carica/PhpCss VS Symfony/CssSelector

I do appreciate for this excellent package. May you please provide some info regarding choosing between Carica/PhpCss and Symfony/CssSelector in the docs?

html treated as xml?

Back in the day I used phpQuery for altering rendered pages just before they are send to the client. Since that project has been quite silent for a long time I decided to look for something else when I needed the same functionality again. I found a few and FluentDOM was one of them. I tested it along with the two officially supported css selectors.

I decided to start by benchmarking the same functionality in FluentDOM and phpQuery and was chocked to see that initializing that took a few ms on phpQuery took almost 200ms for FluentDOM.

$query = FluentDOM::QueryCss($html);

After a while I figured out that there was a large overhead when it tried to figure out what content it was given that I could get rid of by specifying that it was html I was feeding it with. So if I instead used

$fd = FluentDOM::QueryCss($html, 'text/html');

the time for initialization was on par with phpQuery.

So I started updating some old code that used phpQuery and everything went smoothly. But a few times I ran into a minor issue where it would complain about tag mismatching etc. I was confused since html, contrary to xml, is very loose with this stuff. But then I noticed that when working with the supplied html, internally it was handled as xml and causing these "errors" when for instance doing append operations etc.

One part of me like that it complains so I can spot any issue and fix it. But another part of me feels its a bit confusing to be able to specify html but yet have it tested against the xml rules.

Is this by design, by mistake, by bug or just a side affect of libxml and other underlying libraries used?

Allow array syntax for FluentDOM::attr

Implement an FluentDOM::attr property, that allows to trigger get/set attributes using array syntax.

Examples:

 $fd->attr['foo'] = 'bar';
 $fd->attr = array('foo' => 'bar', 'bar' => 'foo');
 $value = $fd->attr['foo'];

YAML loader

A loader for YAML files maybe based on an existing library. It should convert it into a JsonDOM representation.

$fd = FluentDOM::load($source, 'text/yaml');

Add interface ChildNode

https://dom.spec.whatwg.org/#childnode

Add a clone/copy method that to create a duplicate with a different source (optional)

A method that duplicates/clones the current FluentDOM object, it's document, namespaces and loaders.

If no source argument is provided it will copy the references to the matches nodes, too. If a source is provided it will load it.

Add QuerySelectors

It would be possible to add query selectors to the extended DOM classes, but it would require a CSS selector library and I am not sure it is needed.

XPath is a lot more powerful and Query selectors do not support XML namespaces (by definition).

On the other side FluentDOM already support CSS selectors for the FluentDOM\Query class.

setAttributeNodeNS()

setAttributeNodeNS() actually behaves different from setAttributeNode(). Think about redefining the behavior or at least documenting it:

$dom = new DOMDocument();
$dom->formatOutput = TRUE;
$dom->appendChild($dom->createElement('element'));
$dom->documentElement->setAttributeNS('urn:foo', 'foo:attribute', 42);
$attribute = $dom->createAttributeNS('urn:bar', 'bar:attribute');
$attribute->value = 21;
$dom->documentElement->setAttributeNode($attribute);
echo $dom->saveXml();

$dom = new DOMDocument();
$dom->formatOutput = TRUE;
$dom->appendChild($dom->createElement('element'));
$dom->documentElement->setAttributeNS('urn:foo', 'foo:attribute', 42);
$attribute = $dom->createAttributeNS('urn:bar', 'bar:attribute');
$attribute->value = 21;
$dom->documentElement->setAttributeNodeNS($attribute);

echo $dom->saveXml();

Output

<?xml version="1.0"?>
<element xmlns:foo="urn:foo" xmlns:bar="urn:bar" bar:attribute="21"/>
<?xml version="1.0"?>
<element xmlns:foo="urn:foo" xmlns:bar="urn:bar" foo:attribute="42" bar:attribute="21"/>

Add plugin interface for fragment loaders

Allow to register/inject fragment loaders that are used to parse string arguments for methods like FluentDOM\Query::append() depending on the content type. Allow the current loader to register itself for this, too.

Allow DOMNodeList argument for Document::saveXml()/Document::saveHtml()

DOMDocument::saveXml() (and saveHtml()) allow a node as argument. Here is a ticket in the PHP Bugtracker that suggests to allow node lists as well.

It should be possible to implement it into to FluentDOM\Document, without the PHP implementation.

Problems with ETAGO's in HTML5 Documents

Since Fluent uses DOMDocument as its HTML parser, it suffers from a limitation of DOMDocument, in that any ETAGO's contained within a SCRIPT tag will prematurely end the script block, causing your script to fail. For example, the following block:

<script type="text/template" id="tmpl-variation-template">
    <div class="woocommerce-variation-description">
        {{{ data.variation.variation_description }}}
    </div>
</script>

will be transformed by FluentDOM into:

<script type="text/template" id="tmpl-variation-template">
    <div class="woocommerce-variation-description">
        {{{ data.variation.variation_description }}}
    </script>
</div>

This issue is discussed in detail at the following URL's:
http://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag
https://mathiasbynens.be/notes/etago

I'm wondering if there's any way, you could extend FluentDOM\Document to work around this DOMDocument limitation and handle this properly?

Replace getMock()

PHPUnit_Framework_Testcase::getMock() is deprecated.

Provide security contact information

I can't find any information in the README where to report security vulnerabilities. Please add a section with security contact information.

FluentDOM 5.x 6x slower than 4.x ?

Why is this piece of code 6x slower with FluentDOM 5.x compared to 4.x ? (HHVM or not)

$fd = FluentDOM($html, 'text/html');
$r = array();
foreach ($fd->find("//tr[@class='product']") as $fd_child)
{
    $rr = array();
    $rr['imgsrc'] = $fd_child->find("td[@class='image']//img")->attr("src");
    $h3 = $fd_child->find("td[@class='specs']//h3");
    $rr['url'] = $h3->find("a")->attr("href");
    $rr['title'] = $h3->text();
    $rr['desc'] = $fd_child->find("td[@class='specs']")->xml();
    $rr['price'] = $fd_child->find("td[@class='purchase-info']//span[@itemprop='price']")->text();
    $rr['savings'] = $fd_child->find("td[@class='purchase-info']//p[@class='savings']")->text();
    $r[] = $rr;
}

How to install without composer ?

Hi,
I'm writing a Wordpress Plugin and would like to use FluentDOM + Selectors-Symfony selector within.
I don't know much about composer so I would like to avoid using it to install the whole thing.

So I downloaded FluentDOM and Selectors-Symfony.
I extracted them like this :
[..]/lib/FluentDOM-master
[..]/lib/Selectors-Symfony-master

and I'm loading FluentDOM like this :
if (!class_exists('FluentDOM')) require_once([..]/lib/FluentDOM-master/src/FluentDOM.php');
But I'm not sure of the place where I extracted Selectors-Symfony, and I don't know how to "register" it.

Got PHP Fatal error: Interface 'FluentDOM\Node\QuerySelector' not found in...
When trying to run
$fd = FluentDOM::load($htmlstring, 'text/html');
Could you help me ?
Thanks !

Getting the element's tag name

Hi,
I pass nodes to different classes for special handling, I want each class to check if the correct tag type has been provided. I can check if required attributes are there with hasAttr but how do I check if the correct tag is provided in the query?
I think it is possible to get tag name by using the dom object associated with the node but that would somehow kill the purpose. How about adding a function for this?
Thanks

Allow array syntax for FluentDOMStyle::css

Implement an FluentDOMStyle::css property, that allows to get/set css style properties using array syntax.

  $border = $fd->css['border'];
  $fd->css['border'] = 'none';
  $fd->css = array('border' => 'none', 'color' => '#000');

This would be syntax sugar for the FluentDOMStyle::css method.

$dom->find(':root')->append($first->find(":root"))->document->toHtml();` vs $('<div>').append('<input>')`

\\FluentDOM
$first = FluentDOM::QueryCss('<input/>');
$second = FluentDOM::QueryCss('<div></div>');
echo $second->find(':root')->append($first->find(":root"))->document->toHtml();

$jQuery
$('<div></div>).append('<input/>');

The php version can be as simple as the js version.

VCard 4.0 loader

Import VCard 4.0 to its XML representation.

This shares a lot of logic with the iCalendar loader/format