sunra / php-simple-html-dom-parser Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 351.0 213 KB

PHP Simple HTML DOM Parser adaptation for Composer and PSR-0

PHP 44.55% CSS 2.11% JavaScript 1.85% HTML 51.49%

php-simple-html-dom-parser's People

Contributors

Stargazers

Watchers

Forkers

dockawash mikulas apanly nikolaplejic stalance uran1980 videogramme jiangfei925 draeli stha thomasbabuj profburial voidabhi virtuvia jckings2007 franzrecinos thebennos wuchangqian lynnlin aelia-co tydaikho ereader qq1622178458 developer-hosny slowas rosipov evekcin sdfsky rainmanec ivansabik hamzist p0vidl0 eworthcms solodev hupfc hhachim gpsinghjadon sourcecode haoljp javierfernandezbaz dynamohuang 6vl netsnatch jiania chanthoeun chongyi omidaneh watsonweber znanl michaelquery kduma-forks thantai574 wurongzong leoniralves imbagroup xphidalgo lansexinyu520 2bj wesavetheworld gavinkou hvstudio shvedan tgamanov tdchien mryanb nishad nadirhamid saurindashadia farshidalavim algosto tooooooooomy salehhub wanderingzombie chenhaitang pitapun jeffersonchaves abmn614 travisfont mikegustafson101 kevin-jones joegandy zpzgone pine k12f herenchang believelike mrsvette iandreyev mihaiiro weijihao dejurin weichangdong shtse8 jofomah rasaxt bowodesign jackydaniels nigel666 simontech yhif

php-simple-html-dom-parser's Issues

Endless __destruct() loop when served via Laravels php artisan serve command

When I use your library together with Laravel and take advantage of Laravel's possibility to start local development server using command php artisan serve I run into an issue where Laravel server gets stuck in an endless loop of calls to simple_html_dom_node->__destruct(). After maximum execution time is exceeded, Laravel server calls:

Laravel development server started: <http://127.0.0.1:8000>
[Thu Jun 14 09:24:43 2018] PHP Fatal error:  Maximum execution time of 60 seconds exceeded in C:\Users\[REDACTED]\Desktop\Tests\PHP\blog\blog\vendor\sunra\php-simple-html-dom-parser\Src\Sunra\PhpSimple\simplehtmldom_1_5\simple_html_dom.php on line 140
[Thu Jun 14 09:24:43 2018] PHP Stack trace:
[Thu Jun 14 09:24:43 2018] PHP   1. simplehtmldom_1_5\simple_html_dom_node->__destruct() C:\Users\[REDACTED]\Desktop\Tests\PHP\blog\blog\vendor\sunra\php-simple-html-dom-parser\Src\Sunra\PhpSimple\simplehtmldom_1_5\simple_html_dom.php:0

I debugged the issue for a while but could not resolve it by any other way than to delete/rename/comment-out your destructors in mentioned class.

Minimum, Complete and Verifiable example/Steps to reproduce:

Create Laravel project by running composer create-project --prefer-dist laravel/laravel blog
cd blog and update composer.json with required dependency to your library "sunra/php-simple-html-dom-parser": "^1.5"
Run composer update to fetch newly added dependency.
Open file routes/web.php and update its contents to contain following

use Sunra\PhpSimple\HtmlDomParser;
Route::get('/',  function(){
    $input = <<<EOM
    <!-- PUT YOUR NON-TRIVIAL HTML MARKUP HERE -->                    
EOM;
    $parser = new HtmlDomParser();
    $dom = $parser->str_get_html($input);


    return view('welcome');
});

(1) Note that  should really be replaced with non-trivial markup, e.g. google.com's source from front-page.
5. Start local development server php artisan serve and access used address (it defaults to 127.0.0.1:8000)

(2) Note I was not able to reproduce it using on PHPv7.1.14 or PHPv7.1, but PHPv7.1.13, PHPv7.1.18 and even PHPv7.2 do suffer from this behavior.

I worked-around this issue by setting up composer script on post-autoload-dump event where I search and destroy (rename) your destructors.

simple_html_dom_node::text() method is abnormal with embed tag

Example code

$content = "<div class=test><embed src='http://....swf' quality='high' width='480' height='400' align='middle' allowScriptAccess='always' allowFullScreen='true' mode='transparent' type='application/x-shockwave-flash'></embed></div>";
$dom = HtmlDomParser::str_get_html($content);
$newsContent = $dom->find(".test", 0)->text();
var_dump(newsContent);

Expected result:
string(0) ""
Result:
string(8) "</embed>"

innerText() not returning valid utf-8 string

When you attempt to get the text of an element that has no html elements in it it returns a non-utf-8 encoded string. An element such as

<h3>Технические работы на сервере<h3>

the string returned by innerText() is not encoded properly but the string returned by outerText() is returned with the proper encoding. This refers to the simple_html_dom_node class.

getting properties with - (dash) in them

Invalid character class

This pattern:

([\w-:*])(?:#([\w-]+)|.([\w-]+))?(?:[@?(!?[\w-:]+)(?:([!^$]?=)["']?(.*?)["']?)?])?([/, ]+)
Treats the - in both the character groups as ranges rather than characters to match meaning that the regex is looking for everything including and between \w-: rather than the three characters by themselves. The same issue is repeated near the middle of the regex.

See pr #70

Problem get content from khmer24.com

Problem get content from khmer24.com.

//sample code
$html = file_get_html('http://khmer24.com');
print_r($html);

//result is blank page

Note: if I change url to http://google.com it works.

Make composer.json valid

By running composer validate
we get following.

"./composer.json" does not match the expected JSON schema:
   - authors[0].name : The property name is required

Note: I have the patch file but do not have access push to repository to create pull request

Is it possible to create tag 1.5.0 to not to have "dev-master" in composer?

MAX_FILE_SIZE user definition

I've stumble upon edge case where html reached MAX_FILE_SIZE constant, it would be nice to be able to increase it.

It could be implemented really easy just checking if not already defined, then user could redefine it as necessary.

Even better would be exception to know what happened without diving into library code itself.

getElementsByTagName() does not return array

If you try to get all the anchors like this, then by default one (the last) element is returned:
$anchors = $soup->getElementsByTagName('a');
The following does give me all the elements:
$anchors = $soup->getElementsByTagName('a', null);

The idx value defaults to -1. Is this on purpose?

:first-child is not working properly

I found a problem with :first-child psedo-class selector.

For this HTML

<div>
    <a href="javascript:void(0)">&times;</a>
    <div class="links">
        <ul>
            <li>
                <a href="https://github.com/">link 1</a>
                <span>(info)</span>
            </li>
            <li>
                <a href="https://github.com/">link 2</a>
                <span>(info)</span>
            </li>
        </ul>
    </div>
</div>

Selector .links > ul > li:first-child > a matches 0 elements, selector .links > ul > li > a matches two elements.
Expected behavior is that selector .links > ul > li:first-child > a matches this element:

<li>
    <a href="https://github.com/">link 1</a>
    <span>(info)</span>
</li>

Version 1.5.2 Sunra Uncaught error: Call Undefine Function file_get_html()

After upgrading to V1.5.2 it always shows error to this function file_get_html()

Currently the one we are using is v1.5.0 but after updating it now, it shows this error:

Your requirement could not be resolved to an installation set of packages.

Problem 1

The requested package sunra/php-simple-html-dom-parser 1.5.0 exists as sunra/php-simple-html-dom-parser[dev-master, v1.5.1, v1.5.2] but these are rejected by your constraints

curl request works only in local environment

Hello, I'm using the latest version of the parser <1.8.1> downloaded from the official sourceforge page. When I use the function file_get_html(

) to pull a webpage from a remote host, I'm getting a warning that the request has timed out <at line 136>, though the warning/error occurs only when it's made from a remote host/environment - it works perfectly fine when made from my local server.

Edit: That's the whole code on github - here

Additional edit: You can experience the warning/error in the integrated github environment or at my remote server...

what if i'm not sure the html contains h1, h2 ,h3 or h4

is there a way to avoid something like this???

$dom = HtmlDomParser::str_get_html($html_str);

if($dom->find('h1', 0))
    return $dom->find('h1', 0)->plaintext;
if($dom->find('h2', 0))
    return $dom->find('h2', 0)->plaintext;
if($dom->find('h3', 0))
    return $dom->find('h3', 0)->plaintext;
if($dom->find('h4', 0))
    return $dom->find('h4', 0)->plaintext;

Cannot find tags that have additional classes

find method cannot find tag that has additional classes.

For example, I want to find all tags that have 'services' class:

<div class='services'> or
<div class='services last-item'> or
<div class='services active'>

But, If I run:

$html->find('div[class=services]');

I will only get one result:

<div class='services'>

Possible to replace file_get_contents with curl?

I'm trying to fetch some data from external website where I need many requests. Then the file_get_contents() may through some authorization error. What's your thought about this?

Curly brackets cause unexpected behavior

this is a simple PHP script

<?php
$html = HtmlDomParser::str_get_html('<html><body><span>a</span><span>{b</span><span>c}</span><span>d</span></body></html>');

foreach ($html->find('span') as $v) {
    echo $v->innertext."\n";
}
?>

I expected follwings:

A
{b
c}
d

But result is follwings:

a
{b</span><span>c}
d

remove_noise breaking fields

PROBLEM
When parsing a document having: <input name="me" value="my { dog is nice"> the document is parsed in an invalid way. The value property for $input in

   foreach($this->html->find('input[name='me']') as $input)

is "my {dog is nice" plus all remaining HTML, instead of "my {dog is nice".

WORKAROUND
I commented $this->remove_noise("'({\w)(.*?)(})'s", true); in the load method, but I guess an improvement in remove_noise in order to be aware of quotes would be a better solution.

Regards, Pablo.

Is it possible to create php-simple-html-dom-parser compatible with package in Laravel 4?

http://dl.dropbox.com/u/2903731/c.png

No License Available

php-simple-html-dom-parser for PHP7.3 +

If you use php 7.3 and higher, then use my edits. Otherwise, you will get errors due to migration to PCRE2 in new versions of PHP.

For example: Warning: preg_match_all (): Compilation failed: invalid range in character class at offset 4

New file

Warning: file_get_contents(): stream does not support seeking

I have been experiencing the error "Warning: file_get_contents(): stream does not support seeking..." since I upgraded to PHP 7.1.x

Any fixes ?

Is there any way to find if an element has a certain child or not?

So, the page that I am parsing. It has a td tag within that in some cases it has a tag and in some cases it doesn't have.
However, i have tried this $row->find('td', 2)->find('a', 0) and it says can't find value on null.
Is there anyway to find the child exists or not?

One way that I have found is count($row->find('td', 2)->find('a', 0)) and if it returns 1 basically there's a child and otherwise none.
Is there any other way to find it?
Thanks in advance.

Not compatible with php 7.3

PHP Warning 'yii\base\ErrorException' with message 'preg_match(): Compilation failed: invalid range in character class at offset 4'

in .../sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:1378

https://github.com/sunra/php-simple-html-dom-parser/blob/master/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php#L1378

No removeChild() function

As far as I can determine, Simple HTML DOM does not have a way to actually remove DOM elements from a document. This can be troublesome, especially if you're using mpdf to make a PDF file and there's an <svg> tag in there; mpdf flips out whenever it sees one.

There may be a good reason removeChild() has not been implemented, but as a suggestion for a future update, could such a function be implemented?

tag attributes missing

Here is my testing code:

$CC = <<<EOF
<p style="max-width: 100%; min-height: 1em; white-space: pre-wrap; color: rgb(62, 62, 62); text-align: center; font-family: 微软雅黑; font-size: 14px; line-height: 24px; box-sizing: border-box !important; word-wrap: break-word !important; background-color: rgb(255, 255, 255);"><img img_width="500" img_height="398" data-type="jpeg" data-ratio="NaN" data-w="0" width="auto" width="auto" data-src="http://mmbiz.qpic.cn/mmbiz/fZ6yVsBCVhLQdrDUBay4Ps1qhhKGiadibMIdicxOXx74cXsIVxk0Emib1XpZxHUXLuToWEMibPRr0I8noqtuWZfowNg/640?wx_fmt=jpeg"/></p>
EOF;

//well, load the class as u often do
//Loader::import('SimpleHtmlDom', 'html');
$DOM    = str_get_html($CC);
if ( $DOM == false )
{
    return false;
}

echo $DOM->innertext;

and output is:

<p style="max-width: 100%; min-height: 1em; white-space: pre-wrap; color: rgb(62, 62, 62); text-align: center; font-family: 微软雅黑; font-size: 14px; line-height: 24px; box-sizing: border-box !important; word-wrap: break-word !important; background-color: rgb(255, 255, 255);"><img img_width="500" img_height="398" data-type="jpeg" data-ratio="NaN" data-w="0" width="auto"></p>

well, something is missing.

if i comment the code fragment bettween line 1488 to 1491 and i would got what i want:

if (isset($node->attr[$name]))
{
    return;
}

it maybe a bug!

Parser strips new lines

Hello and thank you for your great work,

I'm using php-simple-html-dom-parser in a free project and try to solve a bug that occurred.

This is my code:

 foreach ( $dom->find( 'text' ) as $element ) {
 				if ( !in_array( $element->parent()->tag, [ 'a', 'pre', 'code' ] ) ) {
 					foreach ( $markers as $marker ) {
 						$text               = $marker[ 'text' ];
 						$url                = $marker[ 'url' ];
 						$tip                = strip_tags( $marker[ 'excerpt' ] );
 						$tooltip            = ( $tooltip ? "data-uk-tooltip title='$tip'" : "" );
 						$tmpval             = "tmpval-$i";
 						$element->innertext = preg_replace(
 							'/\b' . preg_quote( $text, "/" ) . '\b/i',
 							"<a href='$url' $hrefclass target='$target' $tmpval>\$0</a>",
 							$element->innertext,
 							1
 						);

 						$element->innertext = str_replace( $tmpval, $tooltip, $element->innertext );
 						$i++;
 					}
 				}
				
 			}

This code searches for text on a page and replaces words with other words.

It works fine.

But as I found out, this code is removing new lines from <pre><code>...</code></pre>:

This is an example-output using the code above:

<pre><code>&lt;div class=&quot;uk-form-row&quot;&gt;     &lt;label class=&quot;uk-form-label&quot;&gt;{{ &#39;Pages&#39; | trans }}&lt;/label&gt;     &lt;div class=&quot;uk-form-controls uk-form-controls-text&quot;&gt;         &lt;input-tree :active.sync=&quot;package.config.nodes&quot;&gt;&lt;/input-tree&gt;     &lt;/div&gt; &lt;/div&gt; </code></pre>

This is an example-output without using the code above:

<pre><code>&lt;div class=&quot;uk-form-row&quot;&gt;
    &lt;label class=&quot;uk-form-label&quot;&gt;{{ &#39;Pages&#39; | trans }}&lt;/label&gt;
    &lt;div class=&quot;uk-form-controls uk-form-controls-text&quot;&gt;
        &lt;input-tree :active.sync=&quot;package.config.nodes&quot;&gt;&lt;/input-tree&gt;
    &lt;/div&gt;
&lt;/div&gt;
</code></pre>

Please allow to set MAX_FILE_SIZE from outside

I'm having some trouble trying to parse documents > MAX_FILE_SIZE. Since this is a constant, I can't redefine this in a clean way. I think you could define this as a public static var in class simple_html_dom_node and use it from there.

file_get_contents(): stream does not support seeking

(1/1) ErrorException
file_get_contents(): stream does not support seeking

$html = HtmlDomParser::file_get_html('http://www.google.com/');

foreach($html->find('a') as $element)
echo $element->href . '
';

Installing php-simple-html-dom-parser in CodeIgniter using composer - doesn't load the library?

Hi,

I installed the library using composer so now I have a folder under "vendor/sunra/php-simple-html-dom-parser/....".
I am pasting my code controller code here (using CodeIgnitrer), and for some reason the library doesn't load properly.
I keep getting the error: "Call to undefined function file_get_html()" when running create_main_array() function.
Is there something that I'm not getting right?
I did include the autoload.php file, like any other library installed with composer and this worked till now.
did the same with use Sunra\PhpSimple\HtmlDomParser;.

<?php 

**require FCPATH. 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;**

/******************************************/
/* 		example Scraping			*/
/******************************************/


class Example extends CI_Controller {

    public function __construct() {
    
        parent::__construct();
          
		// Check if the user is logged in else KICK!: 		   
	    if ( ! $this->session->userdata('is_logged_in') ) {
	    	redirect('login');
	    } 

		// Load 'kas_model' Model
		$this->load->model('users_model');
		$this->load->model('expenses_model');
		
		// Sets the server not to have a time out. 
		ini_set('max_execution_time', 0); 
		ini_set('memory_limit', '-1');		
		// More Of MySQL
		ini_set('mysql.connect_timeout','0');
		// Expand the array displays
		ini_set('xdebug.var_display_max_depth', 5);
		ini_set('xdebug.var_display_max_children', 256);
		ini_set('xdebug.var_display_max_data', 1024);
    }

	// Main Page
	public function index(){
		$this->load->view('header');
		$this->load->view('dashboard');
		$this->load->view('example/main_example');
		$this->load->view('footer');
	}


	// Gets a page [string] variable and returns a string of the HTML. 
	public function scrape_page($page) {

		// $string = file_get_contents($page);
		$string = **file_get_html**($page);
		
		return $string;

	}


	// Running this controller
	public function create_main_array() {

		**$string = $this->scrape_page('https://example.com/websites');

		// Find all images 
		foreach($string->find('img') as $element) 
		       echo $element->src . '<br>';

		// Find all links 
		foreach($string->find('a') as $element) 
		       echo $element->href . '<br>';**

	}

Does not work on php 7.1

when will php 7.1 be supported?

$html->load_file shows errors if a page doesn't exist.

I have a trouble. Using the $html->load_file method, it shows errors if a page doesn't exist.
The error says: 'Warning: file_get_contents(http://auto.desko.kg/car/24779): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in C:\xampp\htdocs\deskoparse\simple_html_dom.php on line 1080'. Is it possible to add checks to the parser so it could find out if such a page exists, and also why that mthod doesn't return 'True' if a page exists?

How to get value from dom?

I have the following table - only 2 rows shown for brevity. How do I traverse the table to extract the price class value for Catalog ID 100245 i.e. H1?

 <tbody>
      <tr class="catalog_line">
         <td class="properties">
            <div class="grid-prop">
               <span class="label nom">Catalog ID</span>
               <span class="catdata1 cdatamarker">100245</span>
            </div>
            <div class="grid-prop nom">
                <span class="label">Product, price class</span>
                <span class="catdata1">
                  <span class="category">Cars</span>
                  , H1
                </span>
            </div>
        </td>
      </tr>
      <tr class="catalog_line">
         <td class="properties">
            <div class="grid-prop">
               <span class="label nom">Catalog ID</span>
               <span class="catdata1 cdatamarker">100246</span>
            </div>
            <div class="grid-prop nom">
                <span class="label">Product, price class</span>
                <span class="catdata1">
                  <span class="category">Cars</span>
                  , H1
                </span>
            </div>
        </td>
      </tr>
  <tbody>

file_get_contents &

on $contents = file_get_contents($url, $use_include_path, $context, $offset = 0) put an & on character &
it happen just when i change my hosting provider. in other hosting provider works great

file_get_html returns false.

file_get_html returns false for this URL: https://tripadvisor.ca/Restaurant_Review-g255344-d724335-Reviews-Dynasty_Chinese_Restaurant-Launceston_Tasmania.html which can be loaded in the browser, but this URL works fine: 'https://tripadvisor.ca'

php 7.1 fix

//function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
function file_get_html($url, $use_include_path = false, $context=null, $offset = 0, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
``

$offset = 0 fix problem with php 7.1

Fatal error: Maximum execution time of 60 seconds exceeded in \Sunra\PhpSimple\simplehtmldom_1_5\simple_html_dom.php on line 151

My class like this

public function LayHinhTuDong9GagAction($id="aP9QwYV%2CaRjvbnq%2CaOB9wDy")
    {
        $client = new \GuzzleHttp\Client();
        $res = $client->request('GET', 'https://9gag.com/?id='.$id.'&c=10',
            [
                'headers' => [
                    'referer'=>'https://9gag.com/',
                    'x-requested-with'=>'XMLHttpRequest',
                    'method'=>'GET',
                    'authority'=>'9gag.com',
                    'path'=>'/?id=aP9QwYV%2CaRjvbnq%2CaOB9wDy&c=10',
                    'scheme'=>'https',
                    'accept'=>'application/json, text/javascript, */*; q=0.01',
                    'accept-encoding'=>'gzip, deflate, br',
                    'user-agent'=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
                    ]
            ]
            );
        $chuoi_dulieu= $res->getBody();
        $stringBody = (string) $chuoi_dulieu;
        $stringBody=\GuzzleHttp\json_decode($stringBody);
        $array_hinh=$stringBody->items;
        echo count($array_hinh);
        var_dump($array_hinh);
        foreach($array_hinh as $key=>$hinh){
            $la_video=0;
            $dom = HtmlDomParser::str_get_html( $hinh );
            $tua_de=$dom->find("h2",0)->plaintext;
            $elems = $dom->find("source");
            if(empty($elems))// xử lý khi là hình gif mp4 video
            {
                $elems_2 = $dom->find("div[class=badge-video-container]");
                if(!empty($elems_2)){
                  if($elems_2[0]->{'data-video-source'}=="YouTube"){
                      echo "YouTube";
                      continue;
                  }
                }
            }else
            {
                $link_video=$elems[0]->src;;
                $link_hinh=preg_replace('/460(\w*)/', "700b", $elems[0]->src);
                $link_hinh=str_replace("mp4","jpg",$link_hinh);
                $slug_id=$this->LuuVideo($link_video,$link_hinh);
                if(!$slug_id)
                {
                    echo "loi";exit;
                }
                $la_video=1;
//                continue;// tiep tuc vong lap bo qua cac ham phia sau
            }
            if($la_video!=1)
            {
            $elems = $dom->find("img");//tim hinh anh khong phai la video
            $link_hinh=preg_replace('/460(\w*)/', "700b", $elems[0]->src);
            $slug_id=$this->LuuAnh($link_hinh);
                if(!$slug_id)
                {
                    echo "loi";exit;
                }
            }
            echo $link_hinh.'<br>';
            $post=new \PostsCollection();
            $post->tua_de=$tua_de;
            $post->link_hinh="/photo/".$slug_id.'.jpg';
            $post->link_goc=$link_hinh;
            $post->save();
//            $dom->clear();
//            unset($dom);
            $this->view->pick("LayHinhTuDong/index");

        }

getAttribute error while the attribute name include '-'

Is there anybody meet this problem before?
like this :
$element->data-lazyload;

For anyone who struggle with php 7.3 - use another package

Here is an updated package for this library:

composer require caophihung94/php-simple-html-dom-parser

Is the selector ' > ' working ?

I'm trying to select 'td > a span' but it's selecting 'td a span'...

file_get_contents(): stream does not support seeking

I have the following warnings when using this library

Warning: file_get_contents(): stream does not support seeking in vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php on line 81

Warning: file_get_contents(): Failed to seek to position -1 in the stream in vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php on line 81

Can someone gives help?

Apply the latest cahnges on composer.json file to 1.5.1 version

The recent changes on composer.json file is in master branch and as the default version is 1.5.1, its needed to merge the composer.json file from master to 1.5.1 branch.

Text subnodes

Is it possible to extract contents of first text node?
I.e. string Hello in subtree

<div>
    Hello
    <strong>World!</strong>
</div>

Maintained Fork of PHP Simple HTML DOM Parser (PHP 7.X)

https://github.com/voku/simple_html_dom

A HTML DOM parser written in PHP - let you manipulate HTML in a very easy way! This is a fork of PHP Simple HTML DOM Parser project but instead of string manipulation we use DOMDocument and modern php classes like "Symfony CssSelector".

PHP 7.0+ Support
PHP-FIG Standard
Composer & PSR-4 support
PHPUnit testing via Travis CI
PHP-Quality testing via SensioLabsInsight
UTF-8 Support (more support via "voku/portable-utf8")
Invalid HTML Support (partly ...)
Find tags on an HTML page with selectors just like jQuery
Extract contents from HTML in a single line

Use newest version of simplehtmldom

simplehtmldom is currently in version 1.8.1
Why not use the latest version?

I had trouble with the current version because mb_detect_encoding isn't available on all systems. This is fixed in version 1.8.1

Can't find simple things

The parser won't find "<p class="body"" in this line:
<script id="forecast-summary-0" type="text/x-jquery-tmpl"> <div id="forecast-summary" class="summary-column"> <h3>Forecast Summary</h3> <div class="forecast-summary" lang="en-GB"> <ul > <li> <h4 class="title">This Evening and Tonight</h4> <p class="body">Fairly cloudy this evening with scattered heavy showers, which gradually ease through the evening. However cloud thickeing overnight to bring periods of occasionally heavy rain before dawn as southeast winds increase strong to near gale.</p> </li> </ul> </div> </div> </script> (all one line). I Use "$body = $body[0]->find('p[body]');" to find it but it returns no results. Is there something I've missed, can you help???

Unable to parse HTML

$dom = HtmlDomParser::str_get_html('

欢迎来到。这是我的第一篇文章。最先写作吧！

');

What is the cause of this mistake?

ErrorException : preg_match(): Compilation failed: invalid range in character class at offset 4

at /Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:1378
1374| $this->char = $this->doc[--$this->pos]; // prev
1375| return true;
1376| }
1377|

1378| if (!preg_match("/^[\w-:]+$/", $tag)) {
1379| $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');
1380| if ($this->char==='<') {
1381| $this->link_nodes($node, false);
1382| return true;

Exception trace:

1 preg_match("/^[\w-:]+$/", "p")
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:1378

2 simplehtmldom_1_5\simple_html_dom::read_tag()
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:1187

3 simplehtmldom_1_5\simple_html_dom::parse()
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:1081

4 simplehtmldom_1_5\simple_html_dom::load("

欢迎来到。这是我的第一篇文章。最先写作吧！

")
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/simplehtmldom_1_5/simple_html_dom.php:102

5 simplehtmldom_1_5\str_get_html("

欢迎来到。这是我的第一篇文章。最先写作吧！

")
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/HtmlDomParser.php:21

6 call_user_func_array("\simplehtmldom_1_5\str_get_html")
/Users/enle/app/hyena-cms/vendor/sunra/php-simple-html-dom-parser/Src/Sunra/PhpSimple/HtmlDomParser.php:21

7 Sunra\PhpSimple\HtmlDomParser::str_get_html("

欢迎来到。这是我的第一篇文章。最先写作吧！

")
/Users/enle/app/hyena-cms/app/Service/ArticleFormatter.php:296

8 App\Service\ArticleFormatter::convertImage(Object(Closure))
/Users/enle/app/hyena-cms/app/Service/ArticleFormatter.php:91

9 App\Service\ArticleFormatter::importImage()
/Users/enle/app/hyena-cms/app/Console/Commands/WpSynchronizationImage.php:50

10 App\Console\Commands\WpSynchronizationImage::handle()
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:32

11 call_user_func_array([])
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:32

12 Illuminate\Container\BoundMethod::Illuminate\Container{closure}()
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:90

13 Illuminate\Container\BoundMethod::callBoundMethod(Object(Illuminate\Foundation\Application), Object(Closure))
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:34

14 Illuminate\Container\BoundMethod::call(Object(Illuminate\Foundation\Application), [])
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Container/Container.php:576

15 Illuminate\Container\Container::call()
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Console/Command.php:183

16 Illuminate\Console\Command::execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
/Users/enle/app/hyena-cms/vendor/symfony/console/Command/Command.php:255

17 Symfony\Component\Console\Command\Command::run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Console/Command.php:170

18 Illuminate\Console\Command::run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/vendor/symfony/console/Application.php:908

19 Symfony\Component\Console\Application::doRunCommand(Object(App\Console\Commands\WpSynchronizationImage), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/vendor/symfony/console/Application.php:269

20 Symfony\Component\Console\Application::doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/vendor/symfony/console/Application.php:145

21 Symfony\Component\Console\Application::run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Console/Application.php:90

22 Illuminate\Console\Application::run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php:122

23 Illuminate\Foundation\Console\Kernel::handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
/Users/enle/app/hyena-cms/artisan:38

Check if parent tag has id, attribute or class.

Hello,

I am looping through a HTML string as follows:

	foreach ( $dom->find( 'text' ) as $element ) {
		
		if ( !in_array( $element->parent()->tag, $excludedParents ) ) {				
			$element->innertext = preg_replace(
				'/(?<!\w)' . preg_quote( $search, "/" ) . '(?!\w)/i',
				$replace,
				$element->innertext
			);
		}
	}

This works fine for excluded parents like a, div or em, but not for a.test or div#test. Is there an elegant way to solve that?

HtmlDomParser::file_get_html() returning false for html page

Hi,

First of all, thanks for this great tool. I'm having a little problem. When I use either HtmlDomParser::file_get_html($urlOfThePage), or get the html of the file with curl and use HtmlDomParser::file_get_html($str) for one specific html page, those functions return false. They are perfectly working fine with other pages but this one. Why would that be?

Thanks.

Blank TD returning next result, not the blank per se

Hi sunra, I am having an issue using Simple HTML DOM Parser. Have used it several times before but until now I came across this issue:

When searching for TDs, when there is a blank TD (with or no content) I get as a result the next TDs.

I have found also that someone reported the same on Stackoverflow: http://stackoverflow.com/questions/11123267/simple-html-dom-parser-return-empty-td-with-all-tds-values

Example as a result of var_dumping $html->find('td'); (element 2 should be blank!):

0
12/02/2014 09:14 AM
1
MEXICO D.F. En proceso de entrega MEX MEXICO D.F.
2

12/02/2014 08:27 AM
MEXICO D.F. Llegada a centro de distribucion
Envio en proceso de entrega

sunra / php-simple-html-dom-parser Goto Github PK

php-simple-html-dom-parser's People

Contributors

Stargazers

Watchers

Forkers

php-simple-html-dom-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org