DOMDocument::loadHTML

(PHP 5, PHP 7, PHP 8)

DOMDocument::loadHTML β€” Π—Π°Π³Ρ€ΡƒΠ·ΠΊΠ° HTML ΠΈΠ· строки

ОписаниС

public function DOMDocument::loadHTML(string $source, int $options = 0): bool

Ѐункция Ρ€Π°Π·Π±ΠΈΡ€Π°Π΅Ρ‚ HTML, содСрТащийся Π² строкС source. Π’ ΠΎΡ‚Π»ΠΈΡ‡ΠΈΠ΅ ΠΎΡ‚ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠΈ XML, HTML Π½Π΅ обязан Π±Ρ‹Ρ‚ΡŒ ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½ΠΎ построСн для Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠΈ.

Π’Π½ΠΈΠΌΠ°Π½ΠΈΠ΅

Для Ρ€Π°Π·Π±ΠΎΡ€Π° ΠΈ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ соврСмСнной HTML-Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠΈ Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡƒΡŽΡ‚ вмСсто класса DOMDocument ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒΡΡ классом Dom\HTMLDocument.

Ѐункция Ρ€Π°Π·Π±ΠΈΡ€Π°Π΅Ρ‚ Π²Ρ…ΠΎΠ΄Π½Ρ‹Π΅ Π΄Π°Π½Π½Ρ‹Π΅ ΠΏΠΎ стандарту HTML 4. ΠŸΡ€Π°Π²ΠΈΠ»Π° синтаксичСского Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΠΎ стандарту HTML 5, с ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΌ Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ соврСмСнныС Π±Ρ€Π°ΡƒΠ·Π΅Ρ€Ρ‹, содСрТат отличия. Π˜Ρ‚ΠΎΠ³ΠΎΠ²Π°Ρ структура DOM зависит ΠΎΡ‚ Π²Ρ…ΠΎΠ΄Π½Ρ‹Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΡŽ нСльзя ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ для бСзопасной очистки HTML-Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠΈ.

ПовСдСниС ΠΏΡ€ΠΈ Ρ€Π°Π·Π±ΠΎΡ€Π΅ HTML-Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠΈ зависит ΠΎΡ‚ вСрсии Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠΈ libxml, это проявляСтся острСС ΠΏΡ€ΠΈ наступлСнии Π³Ρ€Π°Π½ΠΈΡ‡Π½Ρ‹Ρ… условий ΠΈ ΠΏΡ€ΠΈ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ΅ ошибок. Π Π°Π·ΠΌΠ΅Ρ‚ΠΊΡƒ, которая соотвСтствуСт спСцификации HTML5, Ρ€Π°Π·Π±ΠΈΡ€Π°ΡŽΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠΌ Dom\HTMLDocument::createFromString() ΠΈΠ»ΠΈ Dom\HTMLDocument::createFromFile(), ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π΄ΠΎΠ±Π°Π²ΠΈΠ»ΠΈ Π² PHP 8.4.

НапримСр, ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½Ρ‹Π΅ HTML-элСмСнты ΠΏΡ€ΠΈ встрСчС нСявно Π·Π°ΠΊΡ€Ρ‹Π²Π°ΡŽΡ‚ Ρ€ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒΡΠΊΠΈΠΉ элСмСнт. ΠŸΡ€Π°Π²ΠΈΠ»Π° автоматичСского закрытия Ρ€ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒΡΠΊΠΈΡ… элСмСнтов Π² спСцификациях HTML 4 ΠΈ HTML 5 Π½Π΅ΠΎΠ΄ΠΈΠ½Π°ΠΊΠΎΠ²Ρ‹Π΅. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ итоговая структура DOM, ΠΊΠΎΡ‚ΠΎΡ€ΡƒΡŽ Π²ΠΈΠ΄ΠΈΡ‚ класс DOMDocument, отличаСтся ΠΎΡ‚ DOM-структуры, ΠΊΠΎΡ‚ΠΎΡ€ΡƒΡŽ Π²ΠΈΠ΄ΠΈΡ‚ Π±Ρ€Π°ΡƒΠ·Π΅Ρ€, Ρ‡Ρ‚ΠΎ создаёт риск Π²Π·Π»ΠΎΠΌΠ° ΠΈΡ‚ΠΎΠ³ΠΎΠ²ΠΎΠΉ HTML-Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠΈ Π·Π»ΠΎΡƒΠΌΡ‹ΡˆΠ»Π΅Π½Π½ΠΈΠΊΠΎΠΌ.

Бписок ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ²

source

HTML-строка.

options
ΠŸΠΎΠ±ΠΈΡ‚ΠΎΠ²ΠΎΠ΅ Π˜Π›Π˜ (OR) констант ΠΎΠΏΡ†ΠΈΠΉ libxml.

Π’ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅ΠΌΡ‹Π΅ значСния

Ѐункция Π²ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅Ρ‚ true, Ссли Π²Ρ‹ΠΏΠΎΠ»Π½ΠΈΠ»Π°ΡΡŒ ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎ, ΠΈΠ»ΠΈ false, Ссли Π²ΠΎΠ·Π½ΠΈΠΊΠ»Π° ошибка.

Ошибки

Если Ρ‡Π΅Ρ€Π΅Π· Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ source ΠΏΠ΅Ρ€Π΅Π΄Π°Π½Π° пустая строка, Π±ΡƒΠ΄Π΅Ρ‚ сгСнСрировано ΠΏΡ€Π΅Π΄ΡƒΠΏΡ€Π΅ΠΆΠ΄Π΅Π½ΠΈΠ΅. Π­Ρ‚ΠΎ ΠΏΡ€Π΅Π΄ΡƒΠΏΡ€Π΅ΠΆΠ΄Π΅Π½ΠΈΠ΅ гСнСрируСтся Π½Π΅ libxml, поэтому ΠΎΠ½ΠΎ Π½Π΅ ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚Π°Π½ΠΎ функциями ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ ошибок libxml.

НСсмотря Π½Π° Ρ‚ΠΎ Ρ‡Ρ‚ΠΎ нСкоррСктная HTML-Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠ° ΠΎΠ±Ρ‹Ρ‡Π½ΠΎ ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎ загруТаСтся, эта функция ΠΈΠ½ΠΎΠ³Π΄Π° Π³Π΅Π½Π΅Ρ€ΠΈΡ€ΡƒΠ΅Ρ‚ ошибки уровня E_WARNING ΠΏΡ€ΠΈ ΠΎΠ±Π½Π°Ρ€ΡƒΠΆΠ΅Π½ΠΈΠΈ ΠΏΠ»ΠΎΡ…ΠΎΠΉ Ρ€Π°Π·ΠΌΠ΅Ρ‚ΠΊΠΈ. Для ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ Ρ‚Π°ΠΊΠΈΡ… ошибок ΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ΡΡ функциями ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ ошибок модуля libxml.

Бписок измСнСний

ВСрсия ОписаниС
8.3.0 Π’Π΅ΠΏΠ΅Ρ€ΡŒ функция ΠΈΠΌΠ΅Π΅Ρ‚ ΠΏΡ€Π΅Π΄Π²Π°Ρ€ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹ΠΉ логичСский (bool) Ρ‚ΠΈΠΏ Π²ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅ΠΌΠΎΠ³ΠΎ значСния.
8.0.0 ΠŸΡ€ΠΈ статичСском Π²Ρ‹Π·ΠΎΠ²Π΅ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ Ρ‚Π΅ΠΏΠ΅Ρ€ΡŒ Π±ΡƒΠ΄Π΅Ρ‚ Π²Ρ‹Π±Ρ€Π°ΡΡ‹Π²Π°Ρ‚ΡŒΡΡ ошибка Error. Π Π°Π½Π΅Π΅ Π²Ρ‹Π΄Π°Π²Π°Π»Π°ΡΡŒ ошибка уровня E_DEPRECATED.

ΠŸΡ€ΠΈΠΌΠ΅Ρ€Ρ‹

ΠŸΡ€ΠΈΠΌΠ΅Ρ€ #1 Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°

<?php
$doc
= new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo
$doc->saveHTML();
?>

Π‘ΠΌΠΎΡ‚Ρ€ΠΈΡ‚Π΅ Ρ‚Π°ΠΊΠΆΠ΅

  • DOMDocument::loadHTMLFile() - Π—Π°Π³Ρ€ΡƒΠ·ΠΊΠ° HTML ΠΈΠ· Ρ„Π°ΠΉΠ»Π°
  • DOMDocument::saveHTML() - БохраняСт Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ ΠΈΠ· Π²Π½ΡƒΡ‚Ρ€Π΅Π½Π½Π΅Π³ΠΎ прСдставлСния Π² строку, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ HTML
  • DOMDocument::saveHTMLFile() - БохраняСт Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ ΠΈΠ· Π²Π½ΡƒΡ‚Ρ€Π΅Π½Π½Π΅Π³ΠΎ прСдставлСния Π² Ρ„Π°ΠΉΠ», ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ HTML
οΌ‹Π”ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ

ΠŸΡ€ΠΈΠΌΠ΅Ρ‡Π°Π½ΠΈΡ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Π΅ΠΉ 19 notes

up
140
mdmitry at gmail dot com ΒΆ
16 years ago
You can also load HTML as UTF-8 using this simple hack:

<?php

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

?>
up
5
BychkovVV at mail dot ru ΒΆ
6 years ago
If you are loading html content from any website, in "utf-8" encoding, when meta width content-type is not first child of HEAD, it would not be acknowledged by parser (encoding); So you can make this fix:
  function domLoadHTML($html)
           {$testDOM = new DOMDocument('1.0', 'UTF-8');
            $testDOM->loadHTML($html);
            $charset = NULL;
            $searchInElemnt = function(&$item) use (&$searchInElemnt, &$charset)
              {if($item->childNodes)
                 {foreach($item->childNodes as $childItem)
                    {switch($childItem->nodeName)
                       {case 'html':
                        case 'head':
                          $searchInElemnt($childItem);
                          break;
                        case 'meta':
                          $attributes = array();
                          foreach ($childItem->attributes as $attr) 
                            {$attributes[mb_strtoupper($attr->localName)] = $attr->nodeValue;
                            }
                          if(array_key_exists('HTTP-EQUIV', $attributes) && (mb_strtoupper($attributes['HTTP-EQUIV']) == 'CONTENT-TYPE') && array_key_exists('CONTENT', $attributes) && preg_match('~[\s]*;[\s]*charset[\s]*=[\s]*([^\s]+)~', $attributes['CONTENT'], $matches))
                            {$charset = preg_replace('~[\s\']~', '', $matches[1]);
                            }
                       }
                    }
                 }
              };
            $searchInElemnt($testDOM);
            if(isset($charset))
              {$dom = new DOMDocument('1.0', $charset);
               $dom->loadHTML('<?xml encoding="'.$charset.'">'.$html);
               foreach ($dom->childNodes as $item)
               if($item->nodeType == XML_PI_NODE)
                 {$dom->removeChild($item);
                 }
               $dom->encoding = $charset;
              }
            else
              {$dom = $testDOM;                 
              }
            return $dom;
           };
up
64
Shane Harter ΒΆ
16 years ago
DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does. 

This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors. 

Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions. 

You can find more info here http://www.php.net/manual/en/ref.libxml.php
up
63
hanhvansu at yahoo dot com ΒΆ
19 years ago
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cẑnh tranh", you will receive "CѺ‘nh tranh".  I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
    $pageDom = new DomDocument();    
    $searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8"); 
    @$pageDom->loadHTML($searchPage);

?>
up
5
obayed dot opu at gmail dot com ΒΆ
4 years ago
To support HTML5 you have to disable xml error handling by add `LIBXML_NOERROR` as an option of loadHTML method.

Example:

<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><section>I'M UNSUPPORTED</section></body></html>", LIBXML_NOERROR);
echo $doc->saveHTML();
?>
up
17
bigtree at DONTSPAM dot 29a dot nl ΒΆ
21 years ago
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>

If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
up
4
deepakrajpal dot com at gmail dot com ΒΆ
5 years ago
If we are loading html5 tags such as <section>, <svg> there is following error:

DOMDocument::loadHTML(): Tag section invalid in Entity

We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();

This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):

// Create a DOMDocument
$dom = new DOMDocument();

// fix html5/svg errors
libxml_use_internal_errors(true);
        
// Load html 
$dom->loadHTML("<section></section>");
$htmlNodes = $dom->getElementsByTagName('section');

if ($htmlNodes->length == 0) {
    $this->assertFalse(TRUE);
} else {
    $this->assertTrue(TRUE);
}
up
2
Anonymous ΒΆ
4 years ago
loadHTML() & loadHTMLFile() may always generate warnings if the html include some tags such as "nav, section, footer, etc" adopted as of HTML5 (in PHP 8.1.6).

Try to run below.

<?php

$file_name = 'PHP Runtime Configuration - Manual.html'; // Download this file from "https://www.php.net/manual/en/session.configuration.php" in advance.

$doc = new DOMDocument();
$doc->loadHTMLFile($file_name); // if set "LIBXML_NOERROR" as 2nd arg, no error
echo $doc->saveHTML();

// Warning: DOMDocument::loadHTMLFile(): Tag nav invalid in PHP Runtime Configuration - Manual.html, line: 63 in D:\xampp\htdocs\test\xml(dom)\loadHTML\index.php on line 6

?>
up
10
finkenb2 at mail dot lib dot msu dot edu ΒΆ
10 years ago
Warning:  This does not function well with HTML5 elements such as SVG.  Most of the advice on the Web is to turn off errors in order to have it work with HTML5.
up
7
fr at felix-riesterer dot de ΒΆ
10 years ago
Remember: If you use an HTML5 doctype and a meta element like so

<meta charset=utf-8">

your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by "bigtree at 29a"):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
up
13
cake at brothercake dot com ΒΆ
13 years ago
Be aware that this function doesn't actually understand HTML -- it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.

For example, with input like this where the first element isn't closed: 

    <span>hello <div>world</div>

loadHTML will change it to this, which is well-formed but invalid:

    <span>hello <div>world</div></span>
up
5
romain dot lalaut at laposte dot net ΒΆ
19 years ago
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
up
12
Errol ΒΆ
17 years ago
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (<p>).

For example:
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>

while:
<?php
$doc = new DOMDocument();
$doc->loadHTML(
    "<html><body><i>Test</i><br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>
up
5
kerim-yagmurcu at gmx dot de ΒΆ
9 years ago
For those of you who want to get an external URL's class element, I have 2 usefull functions. In this example we get the '<h3 class="r">'
 elements back (search result headers) from google search:

1. Check the URL (if it is reachable, existing)
<?php
# URL Check
function url_check($url) { 
    $headers = @get_headers($url); 
    return is_array($headers) ? preg_match('/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/',$headers[0]) : false; 
};
?>

2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
<?php
# Function to clean a string
function clean($text){
    $clean = html_entity_decode(trim(str_replace(';','-',preg_replace('/\s+/S', " ", strip_tags($text)))));// remove everything
    return $clean;
    echo '\n';// throw a new line
}
?>

After doing that, we can output the search result headers with following method:
<?php
$searchstring = 'djceejay';
$url = 'http://www.google.de/webhp#q='.$searchstring;
if(url_check($url)){
    $doc = new DomDocument;
    $doc->validateOnParse = true;
    $doc->loadHtml(file_get_contents($url));
    $output = clean($doc->getElementByClass('r')->textContent);
    echo $output . '<br>';
}else{
    echo 'URL not reachable!';// Throw message when URL not be called
}
?>
up
4
jamesedwardcooke+php at gmail dot com ΒΆ
17 years ago
Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.

I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.
up
1
divinity76+spam at gmail dot com ΒΆ
5 years ago
if you want to get rid of all the "DOMText elements containing ONLY whitespace", maybe try

<?php

function loadHTML_noemptywhitespace(string $html, int $extra_flags = 0, int $exclude_flags = 0): DOMDocument
{
    $flags = LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS | LIBXML_NONET;
    $flags = ($flags | $extra_flags) & ~ $exclude_flags;

    $domd = new DOMDocument();
    $domd->preserveWhiteSpace = false;
    @$domd->loadHTML('<?xml encoding="UTF-8">' . $html, $flags);
    $removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
        if ($node->hasChildNodes()) {
            // Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
            // that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
            for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
                $removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
            }
        }
        if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && empty(trim($node->textContent))) {
            //echo "Removing annoying POS";
            // var_dump($node);
            $node->parentNode->removeChild($node);
        } //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
    };
    $removeAnnoyingWhitespaceTextNodes($domd);
    return $domd;
}
up
3
Alex ΒΆ
16 years ago
Beware of the "gotcha" (works as designed but not as expected): if you use loadHTML, you cannot validate the document. Validation is only for XML. Details here: http://bugs.php.net/bug.php?id=43771&edit=1
up
4
xuanbn at yahoo dot com ΒΆ
18 years ago
If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset  like

  <meta http-equiv="content-type" content="text/html; charset=utf-8">

I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file

<html>
 <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title> Vietnamese - TiαΊΏng Việt</title>
  </head>
<body></body>
</html>

will be OK with loadHTML() when <meta> tag appear <title> tag.

But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.

<html>
 <head>
    <title> Vietnamese - TiαΊΏng Việt</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
<body></body>
</html>
up
2
piopier ΒΆ
16 years ago
Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.
It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.

<?php
mb_detect_order("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function loadNprepare($url,$encod='') {
        $content        = file_get_contents($url);
        if (!empty($content)) {
                if (empty($encod))
                        $encod  = mb_detect_encoding($content);
                $headpos        = mb_strpos($content,'<head>');
                if (FALSE=== $headpos)
                        $headpos= mb_strpos($content,'<HEAD>');
                if (FALSE!== $headpos) {
                        $headpos+=6;
                        $content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
                }
                $content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
        }
        $dom = new DomDocument;
        $res = $dom->loadHTML($content);
        if (!$res) return FALSE;
        return $dom;
}
?>

NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.