Introduction

Since many businesses understand the significance of data-driven decisions, data scraping has become prevalent. While there are various methods of getting data from the internet, we will cover the Puppeteer tutorial in this guide.

Big businesses often need large volumes of data on a frequent basis. Therefore, automation became necessary and a Puppeteer tutorial became crucial. However, even with automation, you may still encounter some challenges during data scraping. Most websites are JavaScript-based, so you can’t directly access them from the raw HTML code.

This is where headless browsers come in. This Puppeteer tutorial will explore how to automate your headless browsers to serve an effective solution for scraping JavaScript-based websites.

In addition, the Puppeteer tutorial will examine the functionalities, advantages, and limitations of using Puppeteer for web scraping as well as FAQs.

But before we dive into the Puppeteer tutorial, let us briefly examine headless browsers.

Headless Browsers

Headless browsers are similar to regular browsers, except they don’t have a graphical user interface (GUI). They offer the same functionality as regular browsers but are faster and use less space. Therefore, headless browsers streamline the process of data scraping while using fewer resources.

Chromium, Chrome, and Firefox are popular browsers supporting headless mode. Headless browsers are controlled programmatically. One of the most common options is Selenium. However, if you need a fast, headless browser for web scraping, this Puppeteer tutorial is a must read as it may be your ideal solution.

Puppeteer Tutorial: What is it, and how does it work?

Before going into the complex aspects of the Puppeteer tutorial, let us understand the basics. This is necessary to get familiar with terminologies as the Puppeteer tutorial progresses.

Google Puppeteer is a Node Library that provides a high-level API to control headless browsers over the DevTools Protocol. In simpler terms, it makes automation of web actions easier and excellent for web scraping. You can use Puppeteer to automate tasks like filling out forms, clicking links, and submitting buttons with JavaScript commands.

Since Google built Puppeteer, many teams have used it with Chrome for headless browser automation. This tool allows you to control headless browsers remotely and use them to launch JavaScript rendering.

According to the official document, two packages/libraries are maintained for users, including:

Puppeteer: Puppeteer is the primary package designed for browser automation. It downloads a version of Chromium when installed. Therefore, it is often described as the end-user interface of the project.
Puppeteer core: This is the Puppeteer library that can interact with any browser that supports the DevTools Protocol. It is described as the backend of this automation tool.

Puppeteer tutorial installation process

Before we dive deeper into the puppeteer tutorial, let us examine its installation process.

Node installation

This is one of the first packages to install for the Puppeteer tutorial. You cannot install the Puppeteer without the node. However, you need a Node Manager to install the node package.

Use the “Brew Install” command to install the Node package manager on your device. Once it is installed, you can verify its success with the code shown below:

“Node –v, npm –v”

Package installation

The next step in this Puppeteer tutorial is package installation. Once you have verified the success of the node, a folder will be created on your computer. Find the folder and run the initialization command as shown below:

“npm init –y”

This creates a package.json file, which includes the puppeteer dependency and test scripts.

N.B: Add the name of the package.json file of the program you want to run in your script.

Puppeteer installation

To install Puppeteer, you need to execute commands from the terminal. A working directory should contain the package.json file.

This command – “npm install – -save puppeteer” installs the Puppeteer and a suitable version of Chromium that is expected to work with their API, which streamlines the process of installation.

Moving on, you will require a keyword- “puppeteer,” which ensures the Puppeteer library is available in the file.

Puppeteer-core

The puppeteer-core package is not a suitable option for everyone. It is necessary to mention in the Puppeteer tutorial that this package does not, by default, download any browser. Therefore, if you want to use a pre-existing browser, this package may be ideal for your needs. As a result, we have to define an executable Path option that contains the browser path.

Environment variables

Another part of the Puppeteer tutorial is setting the environment variables help you to customize it. This is crucial, especially if you would prefer to specify the version of Chromium to use or skip the process of downloading browsers alongside the Puppeteer.

Here are two environment variables:

Puppeteer_Executable_Path: This environmental variable allows you to customize the browser to suit your needs. You can set this to the path of the Chrome browser on your device.

Puppeteer_Skip_Chromium_Download: This environmental variable allows you to customize the package to skip the Chromium download.

Puppeteer Tutorial to launch the headless browser and its functionalities

Now that we have installed the required packages and customized the environment variables for the Puppeteer tutorial, let’s explore how to launch the headless browsers and their functionalities. Some of the functionalities include:

Launch the browser

You can launch the browser with the “launch ( )” keyword with Puppeteer. Note that for the Puppeteer tutorial, the browser is launched in the headless mode.

const browser = await puppeteer .launch({ });

However, you can customize the command to include an object. If you do this, the package will open a full version of the browser instead of the headless mode.

You can customize it by: “launch ( {headless : false} )”

Browser size

Another functionality we will consider in this Puppeteer tutorial is the browser size. After launching the browser, you can decide to maximize the screen.

Including this feature in the Puppeteer tutorial is necessary because the browser opens in the default size of 800x600px. However, you can change the value of the browser size to suit your preferences.

Slow Mo

The Slow Mo option is a critical functionality in this puppeteer tutorial. It is used to slow down operations carried out by the Puppeteer. You can customize the code to your desired milliseconds.

For example, if you want to slow down the puppeteer operations by 300 milliseconds, the code will look like this: “launch({headless: false, slowMo: 300})”

Chrome Devtools

Once the browser is running, you right-click on the page where you select “inspect.” This allows you to open the Devtools to debug the application browser codes.

For this puppeteer tutorial, Instead, you can use a code that navigates to the Devtools directly. Therefore, you can query the DOM and interact with the panels easily.

It should look like this: “launch ({devtools : true})”

Launch URL

Another functionality in this puppeteer tutorial is URL launch. This step is critical because other activities like web scraping rely heavily on it. You cannot scrap a website if it has not been launched. The URL launch opens a website using the go to ( ) function.

It should look like “goto (url)”

Screenshot

The screenshot is a useful functionality in the puppeteer tutorial. When you implement the command, you can take screenshots as the browser runs. You can use the Puppeteer Node library for this function. This is because the library offers a high-level API to control the headless browser over the DevTools.

Once the screenshot has been successfully taken, a jpg file named “screenshot” will be created in your work folder.

Switch to New tab

One of the issues people encounter in headless scraping is working with multiple tabs. Therefore, it is important to address it in this puppeteer tutorial. You can use this code “page.bringToFront()” to open a link in a new tab in Puppeteer.

Scraping an element

Before you go into scraping an element, you must define the URL and the specific elements you need. Once you launch the headless browser, it sends a request to the web address. In return, you receive the HTML content. Therefore, it is necessary for the Puppeteer tutorial to address this function. Here are the steps in scraping an element:

Send the HTTP request
Parse the HTTP response and extract the required data
Save the data on your computer

N.B: Be sure to include the element- like title, header, table, and year of production, in the code.

Furthermore, you can scrape more than one element from a website at a time. The first step in the puppeteer tutorial is to select a “querySelectorAll” to get all the elements that match a selector.

Debugger

After executing the scraping activities, the next step in the Puppeteer tutorial is configuring the debugger in the automation process. Use the code “await page.evaluate ( ( ) ={debugger;});” to get a current page Dom file in ChromeDevtools.

PDF

Converting HTML text to a PDF page is a critical feature in this puppeteer tutorial. In addition, you can modify the page layout and manipulate some of the HTML elements.

This functionality is especially important for web scraping when you want to generate texts in PDF format. Therefore, the PDF style must be defined. For example, if you define your format as A5, the code will look something like this: “pdf ({format: ‘A4’})”

Element value

It is imperative that this puppeteer tutorial shows how to get the element value. The function to get the element value is $eval (). For example, you can define the “Title Text” element as the one you want to get the value.

Therefore, this will set two parameters as an argument where the first will be the selector and the second will be the “element = ele.textContent”

Element count

Element count is a simple yet useful functionality in the puppeteer tutorial. You can use the “$$eval ()” function to determine the number of elements in a specific page.

Type

Since every website has an input field, the Puppeteer tutorial will consider the feature. You can determine the input on a web page with this functionality. Puppeteer’s page method can be used to define the type of input. This puppeteer tutorial will use the “.type,” which includes a CSS selector to identify the element you want to type and a string you want to type in the field.

Dropdown

Dropdown is another feature to consider in the puppeteer tutorial, we shall consider. Puppeteer has a select (selector, value) feature to obtain the value from the dropdown. It takes two arguments as input- the first is taken as a selector, while the second is taken as value.

Checkbox

The checkbox is another element in the puppeteer tutorial that we can manage by assigning two inputs. The first input is assigned as the selector- the option you want to select. On the other hand, the second input is taken as the click count.

Click

The final functionality to consider in the puppeteer tutorial is click. The “click ()” feature lets you click on any element. Finding the element may be a bit difficult, but once you have successfully found it, you can use the click function.

Puppeteer Tutorial: How to set up

Before setting up your Puppeteer, you must have gone through the installation process- package installation and node installation. Here is a step-by-step Puppeteer tutorial for success:

Set up a directory

The first step in the Puppeteer tutorial is setting up a directory. Once you have downloaded all the prerequisites, create a folder to save your JavaScript files. The next step is to navigate to it and run the “npm init-y” command, which creates a package.json file in the directory.

Run the Puppeteer

As mentioned earlier, Puppeteer comes with a compatible version of a browser. In other words, for this Puppeteer tutorial, when you install the package, it automatically downloads the browser version that will work with your Puppeteer version.

Create a new file

The next step in this puppeteer tutorial is to create a new file in your node package directory. You can edit the name of the file to one that suits your needs.

Now, you can launch your browser with the “const browser = await puppeteer.launch ()” command. This will open your browser in headless mode, but you can add an object in the bracket – ({headless:false}) if you need a browser with a graphical user interface.

Open a page

You can open a web page with the command “const page = await browser.newPage ().” Once your page opens, you can load any website with the goto() function- with the URL of the site in the bracket.

We recommend that you take a screenshot to verify the DOM elements and the rendered page are available. Remember, in the puppeteer tutorial instructions, we mentioned how to customize the size of the page. Once you are done with the page, close the browser with the “await browser.close ()” command. This creates a new PNG file in the same directory.

Begin scraping

Once you have launched the page, the next step in the Puppeteer tutorial is to start scraping. Open the web address in your preferred browser, right-click anywhere on the page for the dropdown menu, and select the option to inspect. The DevTools with the elements will open, and you can identify the ID and class of the elements.

Now that you have identified these characteristics, you can extract the data using the “document.querrySelector (‘#elementid’)” command. Puppeteer loads a page in DOM when you use it to render a website. As a result, you can easily extract any type of data from the website. In addition, you can use the evaluate() function to execute JavaScript functions with ease.

You can also use the “querySelectorAll” function to scrape multiple elements from a page. The command will retrieve all elements on the page that match the selector. Then, create an array and utilize the map() function to process and return each element in it. Surround these commands with the “page.evaluate()” function.

Save the result in a variable to complete the functionality. In addition, remember to close the browser when you have finished scraping.

Puppeteer Tutorial: Using Python service for web scraping

Pyppeteer is a Python wrapper for the JavaScript library, Puppeteer. This library is ideal for rendering and scraping JavaScript-heavy websites. It is quite similar to Selenium but limited to Chromium browsers and JavaScript.

This Puppeteer tutorial will be incomplete without mentioning how to use Pyppeteer for web scraping. Therefore, Pyppeteer may be an excellent option if you are partial to Python web scraping.

However, before you go into web data extraction in the Puppeteer tutorial, you need to download and install some required libraries.

First, generate a Python virtual environment with pipenv and install the following:

$ pipenv shell
$ pipenv – three
$ pipenv install pyppeteer

After installing the above, you will have the basic tools to use Pyppeteer. The library automatically downloads Chromium, which allows you to launch it in headless mode when necessary.

The next step in the Puppeteer tutorial is to execute the following commands:

import pprint
import asyncio
from pyppeteer import launch

You can use the extract_all() as an entry point, which receives your target URL, so that you can use the get_browser() function. The default mode of the browser is headless. So, if you need the graphical user interface, customize the headless parameter to “false” before launching the browser.

Our next step in this Puppeteer tutorial is to go through the URL dictionary with the extract function and use the get_page function to open a new tab in the browser. You can load the URL from the site you’re trying to scrape at this stage of the Puppeteer tutorial.

Now, you are almost ready to retrieve data from the web page. Use the extract_data function and choose your tr nodes via the XPath selector.

Advantages of Puppeteer for web scraping

The primary goal of Puppeteer is to provide users with a library that emphasizes the functionalities of the DevTools protocol. In addition, its high-level API makes it highly intuitive, which makes it easier to understand and use.

Here are the advantages of Puppeteer for web scraping:

Lightweight and easy to use

The primary advantage we shall consider in this Puppeteer tutorial is its lightweight and easy use. Developed by Google, Puppeteer is a canonical library that utilizes the power of the DevTools protocol. In addition, it comes with a version of the browser that synchronizes with it for streamlined operations.

Furthermore, Puppeteer stands out for its ease of configuration, which makes it easy to use and set up.

Screenshot and PDF features

Another advantage of this Puppeteer tutorial is the invaluable features we have described earlier. Beyond web scraping, Puppeteer allows users to capture screenshots. In addition, you can export data in PDF format and customize the size

It does not depend on WebDrivers

One thing that stands out in this Puppeteer tutorial is its functionality without WebDrivers. It runs directly in the headless mode on browsers. Therefore, it is often used as an alternative to Selenium. Most times, it outperforms WebDrivers in terms of speed.

Strong community support

Puppeteer has a strong and active community. Therefore, users can enjoy all the resources and support from other experts. In addition, it offers comprehensive official documentation, making it easy to understand, use, and troubleshoot.

Performance and speed

Puppeteer minimizes performance overhead by operating off-process. This feature optimizes security, an excellent tool for scraping potentially malicious sites. We mentioned in this Puppeteer tutorial that it has an intuitive high-level API that optimizes performance.

In addition, Puppeteer stands out because of its fast execution speed, which makes it great for various activities.

Limitations of Puppeteer

From this Puppeteer tutorial, it is an unarguably powerful tool for web scraping. However, it is necessary to consider its limitations before using it. Some of the limitations include:

Thin wrapper nature

Puppeteer is not a complete framework, as it is primarily a thin wrapper around DevTools protocols. Therefore, it may lack some features available in a complete automation framework

Limited browser support

Puppeteer has limited versatility in terms of support for browsers. It works with Chromium, Chrome, and Firefox. Therefore, it may not be ideal when you need to interact with other browsers.

Focuses on JavaScript

Another limitation to consider in this puppeteer tutorial is it is JavaScript-centric. In other words, it has limited support and compatibility with other programming languages. If your strength of operations is not JavaScript, you may need to consider another option.

Puppeteer tutorial: How to streamline web scraping with Netnut proxies

Many websites have implemented strategies to protect their data. Therefore, they can identify headless browsers and block their activities. While web scraping is legal, anti-bot mechanisms may make it really challenging.

Therefore, organizations and companies serious about getting constant and updated data need some techniques to bypass these anti-bot measures- Proxies.

Netnut offers various solutions that can streamline the process of web data extraction. One of the most widely used solutions is the rotating proxies. They provide an unmatched degree of anonymity, which can help minimize IP bans.

Another prevalent challenge in web scraping is CAPTCHAs- a bot that prevents the activities of other bots. However, with NetNut proxies, you can bypass these bots. As a result, it streamlines the process of web data extraction.

Depending on your web scraping needs, Netnut is a guaranteed partner. If you need customized mobile solutions, Netnut has got you covered.

Netnut has an extensive network of over 52 million rotating residential proxies in 195 countries and over 250,000 mobile IPS in over 100 countries, which helps them provide exceptional data collection services.

Conclusion

This Puppeteer tutorial has examined several critical aspects of web scraping. The guide explored headless browsers and how they can optimize the process of web data extraction. The installation process for Puppeteer includes downloading and installing the required packages. Some of the functionalities in this Puppeteer tutorial include a headless browser, browser size, screenshot, PDF, click, checkbox, dropdown, and more.

Web scraping gives your business an incredible leverage over your competitors. Using Puppeteer for your web scraping activities allows for maximum customization. However, proxies are a significant part of the process due to the challenges of web scraping.

Regardless of the frequency of your scraping activities, you need to optimize your web scraping activities. Contact NetNut today to get the best-customized solution for you.

Frequently Asked Questions

Is Puppeteer a replacement for Selenium?

No, Puppeteer was not designed to replace Selenium. While Selenium prioritizes cross-browser automation, Puppeteer focuses on Chromium. Also, Selenium aims to provide a standard API that works across all major browsers. On the other hand, Puppeteer aims to offer extensive functionality and higher reliability.

Why do websites implement anti-scraping mechanisms?

Making data-driven decisions can give businesses a competitive advantage. However, the concept of web scraping remains highly controversial. This is due to the misuse of data, and it can be in various forms, such as:

Plagiarism: Since web scraping involves collecting data, reproducing the data without permission from the author is plagiarism.
Spamming: Web scraping allows the gathering of contact information like emails and phone numbers, which can then be used to spam people.
Identity theft: People with bad intentions can use web scraping to gather personal data from social media platforms, which allows them to impersonate others.

How can I keep Puppeteer scraping ethical?

There are some rules that can make your Puppeteer scraping ethical. First, explore the website to find the robots.txt file if it is available. It often indicates what kind of data you can extract and the allowed frequency of sending requests.

Secondly, read the terms and conditions (especially if you cannot locate the robot.txt file) of the website you want to scrape. Most times, we simply select “I agree” without any knowledge of what we consented to.

Always remember that the data you extracted is not your property. Therefore, it should be treated with respect. For example, if you want to publish part of the data, be sure to give appropriate credit to the owners. In addition, aggressive scraping can cause internal issues on the website.

Daniel Halperin

QA Specialist

Daniel Halperin is a seasoned QA Engineer with a strong background in software quality assurance. He is currently working at NetNut Proxy Network in Tel Aviv, Israel, where he specializes in test planning, stress testing, and Bash scripting. Previously, he contributed to the success of many projects, where he designed and executed manual and automated test strategies, improved product stability with automated API testing, and implemented CI for API tests. With a solid foundation in software testing and a passion for ensuring product reliability, Daniel is a valuable asset to any development team.

Puppeteer Tutorial for Website Scraping – NetNut