Puppeteer allows us to automate a web browser, and this also includes being able to use Javascript to get DOM elements on the page. In the web browsers we use, we would go to the developer tools and use the console to write Javascript code that can get elements.
In Puppeteer, we can use code to get DOM elements on our page. There are two ways we can do this, using page.$
and page.eval
To get an element from a webpage loaded by Puppeteer, we can call page.$
what this does is run document.querySelector
in the browser
first lets create our the basic scaffolding for our Puppeteer application, which will just be us opening a web page
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await browser.close();
})();
now after our call to page.goto and before our call to browser.close we want to add a line of code that gets the first p tag on the page
const getPTag = await page.$selector('p');
Which will run
document.querySelector('p');
Lets also add a console.log to see what we get back.
ElementHandle {
_disposed: false,
_context: ExecutionContext {
_client: CDPSession {
eventsMap: [Map],
emitter: [Object],
_callbacks: Map(0) {},
_connection: [Connection],
_targetType: 'page',
_sessionId: 'F70B3B16423F4F2AD90513F2EBA7F79A'
},
_world: DOMWorld {
_documentPromise: [Promise],
_contextPromise: [Promise],
_contextResolveCallback: null,
_detached: false,
_waitTasks: Set(0) {},
_boundFunctions: Map(0) {},
_ctxBindings: Set(0) {},
_settingUpBinding: null,
_frameManager: [FrameManager],
_frame: [Frame],
_timeoutSettings: [TimeoutSettings]
},
_contextId: 3,
_contextName: ''
},
Which doesnt have anything directly useful for us. so we need to call the getProperty
function and pass in a hardcoded string named innerHTML
const getInnerHTMLProperty = await getPTag.getProperty('innerHTML');
and lets console log getInnerHTML property.
JSHandle {
_disposed: false,
_context: ExecutionContext {
_client: CDPSession {
eventsMap: [Map],
emitter: [Object],
_callbacks: Map(0) {},
_connection: [Connection],
_targetType: 'page',
_sessionId: '920E56028019813F1E7E01A0EF2343DE'
},
_world: DOMWorld {
_documentPromise: [Promise],
_contextPromise: [Promise],
_contextResolveCallback: null,
_detached: false,
_waitTasks: Set(0) {},
_boundFunctions: Map(0) {},
_ctxBindings: Set(0) {},
_settingUpBinding: null,
_frameManager: [FrameManager],
_frame: [Frame],
_timeoutSettings: [TimeoutSettings]
},
_contextId: 3,
const puppeteer = require('puppeteer');
_contextName: ''
},
_client: CDPSession {
const puppeteer = require('puppeteer');
eventsMap: Map(29) {
'Fetch.requestPaused' => [Array],
'Fetch.authRequired' => [Array],
'Network.requestWillBeSent' => [Array],
We just have one last call we need to make. we need to get the JSON value of this handle
const getPtagValue = await getInnerHTMLProperty.jsonValue();
and then when we run node index.js
one last time we get the text inside the first p tag
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
We needed to get 3 async variables to end up getting the result we needed. This was fairly clean because we’re been using async and await
But we could also get the values we need using just one async await call using $eval. Lets create a new file for this implementation and add the basic scaffolding that opens up a page.
Now after page.goto add the following line of code
const getPtag = await page.$eval('p', (pTag => pTag.innerHTML));
and lets add a console.log calling getPTag after declaring it.
Now when we run node evaluateFunction.js
We also get back the value of first P tag.
Generally using page.$eval is recommended because
But page.$ has its own benefits, since it returns back an Element handle from Puppeteer, we get functions that are available to us in the browser
For example, the Element Handler’s click function doesn’t just get the element and call .click
or dispatch a click event, it scrolls down to the element and then clicks it.
The element function also has a drag and drop functionality that drags an element and drops it over another element.
If you want to get all specified elements (as done when you run document.querySelectorAll
) we can use page.$$ instead of page.$ and $$eval instead of $eval