Simple HTML DOM is a great open-source parser:
It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you’d see in JavaScript, such as the “find” function, which will return all instances of elements of that tag name.
I’ve used this in a number of tools, testing it on many different types of web pages, and I think it works great.
These are the main features:
- HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way
- Require PHP 5+
- Supports invalid HTML
- Find tags on an HTML page with selectors just like jQuery
- Extract contents from HTML in a single line
How to get HTML elements:
// Create DOM from URL or file $html = file_get_html('http://www.example.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '<br>'; // Find all links foreach($html->find('a') as $element) echo $element->href . '<br>';
How to modify HTML elements:
// Create DOM from string $html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>'); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo'; echo $html;
Extract content from HTML:
// Dump contents (without tags) from HTML echo file_get_html('http://www.google.com/')->plaintext;
Scraping Slashdot:
// Create DOM from URL $html = file_get_html('http://slashdot.org/'); // Find all article blocks foreach($html->find('div.article') as $article) { $item['title'] = $article->find('div.title', 0)->plaintext; $item['intro'] = $article->find('div.intro', 0)->plaintext; $item['details'] = $article->find('div.details', 0)->plaintext; $articles[] = $item; } print_r($articles);
Leave a Reply