.. index::
single: DomCrawler
single: Components; DomCrawler
The DomCrawler Component
========================
The DomCrawler component eases DOM navigation for HTML and XML documents.
.. note::
While possible, the DomCrawler component is not designed for manipulation
of the DOM or re-dumping HTML/XML.
Installation
------------
.. code-block:: terminal
$ composer require symfony/dom-crawler
.. include:: /components/require_autoload.rst.inc
Usage
-----
.. seealso::
This article explains how to use the DomCrawler features as an independent
component in any PHP application. Read the :ref:`Symfony Functional Tests `
article to learn about how to use it when creating Symfony tests.
The :class:`Symfony\\Component\\DomCrawler\\Crawler` class provides methods
to query and manipulate HTML and XML documents.
An instance of the Crawler represents a set of :phpclass:`DOMElement` objects,
which are nodes that can be traversed as follows::
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
Hello World!
Hello Crawler!
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
var_dump($domElement->nodeName);
}
Specialized :class:`Symfony\\Component\\DomCrawler\\Link`,
:class:`Symfony\\Component\\DomCrawler\\Image` and
:class:`Symfony\\Component\\DomCrawler\\Form` classes are useful for
interacting with html links, images and forms as you traverse through the HTML
tree.
.. note::
The DomCrawler will attempt to automatically fix your HTML to match the
official specification. For example, if you nest a ``
`` tag inside
another ``
`` tag, it will be moved to be a sibling of the parent tag.
This is expected and is part of the HTML5 spec. But if you're getting
unexpected behavior, this could be a cause. And while the DomCrawler
isn't meant to dump content, you can see the "fixed" version of your HTML
by :ref:`dumping it `.
.. note::
If you need better support for HTML5 contents or want to get rid of the
inconsistencies of PHP's DOM extension, install the `html5-php library`_.
The DomCrawler component will use it automatically when the content has
an HTML5 doctype.
Node Filtering
~~~~~~~~~~~~~~
Using XPath expressions, you can select specific nodes within the document::
$crawler = $crawler->filterXPath('descendant-or-self::body/p');
.. tip::
``DOMXPath::query`` is used internally to actually perform an XPath query.
If you prefer CSS selectors over XPath, install the CssSelector component.
It allows you to use jQuery-like selectors to traverse::
$crawler = $crawler->filter('body > p');
An anonymous function can be used to filter with more complex criteria::
use Symfony\Component\DomCrawler\Crawler;
// ...
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});
To remove a node the anonymous function must return false.
.. note::
All filter methods return a new :class:`Symfony\\Component\\DomCrawler\\Crawler`
instance with filtered content.
Both the :method:`Symfony\\Component\\DomCrawler\\Crawler::filterXPath` and
:method:`Symfony\\Component\\DomCrawler\\Crawler::filter` methods work with
XML namespaces, which can be either automatically discovered or registered
explicitly.
Consider the XML below:
.. code-block:: xml
tag:youtube.com,2008:video:kgZRZmEc9j4Chordates - CrashCourse Biology #24widescreen
This can be filtered with the ``Crawler`` without needing to register namespace
aliases both with :method:`Symfony\\Component\\DomCrawler\\Crawler::filterXPath`::
$crawler = $crawler->filterXPath('//default:entry/media:group//yt:aspectRatio');
and :method:`Symfony\\Component\\DomCrawler\\Crawler::filter`::
$crawler = $crawler->filter('default|entry media|group yt|aspectRatio');
.. note::
The default namespace is registered with a prefix "default". It can be
changed with the
:method:`Symfony\\Component\\DomCrawler\\Crawler::setDefaultNamespacePrefix`
method.
The default namespace is removed when loading the content if it's the only
namespace in the document. It's done to simplify the XPath queries.
Namespaces can be explicitly registered with the
:method:`Symfony\\Component\\DomCrawler\\Crawler::registerNamespace` method::
$crawler->registerNamespace('m', 'http://search.yahoo.com/mrss/');
$crawler = $crawler->filterXPath('//m:group//yt:aspectRatio');
Verify if the current node matches a selector::
$crawler->matches('p.lorem');
Node Traversing
~~~~~~~~~~~~~~~
Access node by its position on the list::
$crawler->filter('body > p')->eq(0);
Get the first or last node of the current selection::
$crawler->filter('body > p')->first();
$crawler->filter('body > p')->last();
Get the nodes of the same level as the current selection::
$crawler->filter('body > p')->siblings();
Get the same level nodes after or before the current selection::
$crawler->filter('body > p')->nextAll();
$crawler->filter('body > p')->previousAll();
Get all the child or parent nodes::
$crawler->filter('body')->children();
$crawler->filter('body > p')->parents();
Get all the direct child nodes matching a CSS selector::
$crawler->filter('body')->children('p.lorem');
Get the first parent (heading toward the document root) of the element that matches the provided selector::
$crawler->closest('p.lorem');
.. note::
All the traversal methods return a new :class:`Symfony\\Component\\DomCrawler\\Crawler`
instance.
Accessing Node Values
~~~~~~~~~~~~~~~~~~~~~
Access the node name (HTML tag name) of the first node of the current selection (e.g. "p" or "div")::
// returns the node name (HTML tag name) of the first child element under
$tag = $crawler->filterXPath('//body/*')->nodeName();
Access the value of the first node of the current selection::
// if the node does not exist, calling to text() will result in an exception
$message = $crawler->filterXPath('//body/p')->text();
// avoid the exception passing an argument that text() returns when node does not exist
$message = $crawler->filterXPath('//body/p')->text('Default text content');
// pass TRUE as the second argument of text() to remove all extra white spaces, including
// the internal ones (e.g. " foo\n bar baz \n " is returned as "foo bar baz")
$crawler->filterXPath('//body/p')->text('Default text content', true);
Access the attribute value of the first node of the current selection::
$class = $crawler->filterXPath('//body/p')->attr('class');
Extract attribute and/or node values from the list of nodes::
$attributes = $crawler
->filterXpath('//body/p')
->extract(['_name', '_text', 'class'])
;
.. note::
Special attribute ``_text`` represents a node value, while ``_name``
represents the element name (the HTML tag name).
Call an anonymous function on each node of the list::
use Symfony\Component\DomCrawler\Crawler;
// ...
$nodeValues = $crawler->filter('p')->each(function (Crawler $node, $i) {
return $node->text();
});
The anonymous function receives the node (as a Crawler) and the position as arguments.
The result is an array of values returned by the anonymous function calls.
When using nested crawler, beware that ``filterXPath()`` is evaluated in the
context of the crawler::
$crawler->filterXPath('parent')->each(function (Crawler $parentCrawler, $i) {
// DON'T DO THIS: direct child can not be found
$subCrawler = $parentCrawler->filterXPath('sub-tag/sub-child-tag');
// DO THIS: specify the parent tag too
$subCrawler = $parentCrawler->filterXPath('parent/sub-tag/sub-child-tag');
$subCrawler = $parentCrawler->filterXPath('node()/sub-tag/sub-child-tag');
});
Adding the Content
~~~~~~~~~~~~~~~~~~
The crawler supports multiple ways of adding the content::
$crawler = new Crawler('');
$crawler->addHtmlContent('
Article 1Article 2Article 3
';
$crawler = new Crawler();
$crawler->addHtmlContent($html);
$crawler->filterXPath('//span[contains(@id, "article-")]')->evaluate('substring-after(@id, "-")');
/* Result:
[
0 => '100',
1 => '101',
2 => '102',
];
*/
$crawler->evaluate('substring-after(//span[contains(@id, "article-")]/@id, "-")');
/* Result:
[
0 => '100',
]
*/
$crawler->filterXPath('//span[@class="article"]')->evaluate('count(@id)');
/* Result:
[
0 => 1.0,
1 => 1.0,
2 => 1.0,
]
*/
$crawler->evaluate('count(//span[@class="article"])');
/* Result:
[
0 => 3.0,
]
*/
$crawler->evaluate('//span[1]');
// A Symfony\Component\DomCrawler\Crawler instance
Links
~~~~~
Use the ``filter()`` method to find links by their ``id`` or ``class``
attributes and use the ``selectLink()`` method to find links by their content
(it also finds clickable images with that content in its ``alt`` attribute).
Both methods return a ``Crawler`` instance with just the selected link. Use the
``link()`` method to get the :class:`Symfony\\Component\\DomCrawler\\Link` object
that represents the link::
// first, select the link by id, class or content...
$linkCrawler = $crawler->filter('#sign-up');
$linkCrawler = $crawler->filter('.user-profile');
$linkCrawler = $crawler->selectLink('Log in');
// ...then, get the Link object:
$link = $linkCrawler->link();
// or do all this at once:
$link = $crawler->filter('#sign-up')->link();
$link = $crawler->filter('.user-profile')->link();
$link = $crawler->selectLink('Log in')->link();
The :class:`Symfony\\Component\\DomCrawler\\Link` object has several useful
methods to get more information about the selected link itself::
// returns the proper URI that can be used to make another request
$uri = $link->getUri();
.. note::
The ``getUri()`` is especially useful as it cleans the ``href`` value and
transforms it into how it should really be processed. For example, for a
link with ``href="#foo"``, this would return the full URI of the current
page suffixed with ``#foo``. The return from ``getUri()`` is always a full
URI that you can act on.
Images
~~~~~~
To find an image by its ``alt`` attribute, use the ``selectImage`` method on an
existing crawler. This returns a ``Crawler`` instance with just the selected
image(s). Calling ``image()`` gives you a special
:class:`Symfony\\Component\\DomCrawler\\Image` object::
$imagesCrawler = $crawler->selectImage('Kitten');
$image = $imagesCrawler->image();
// or do this all at once
$image = $crawler->selectImage('Kitten')->image();
The :class:`Symfony\\Component\\DomCrawler\\Image` object has the same
``getUri()`` method as :class:`Symfony\\Component\\DomCrawler\\Link`.
Forms
~~~~~
Special treatment is also given to forms. A ``selectButton()`` method is
available on the Crawler which returns another Crawler that matches ``