How to crawl multiple web pages using Python

3 min

What is a web crawler

A web crawler, also called a spider or robot, is like a digital explorer that automatically visits websites and collects information.

They're mostly used by search engines like Google to find and organize web pages so they can show you relevant search results.

Web crawlers can also be used for other things like collecting data or checking for changes on websites.

Some websites don't like web crawlers, so they might block them from visiting their site.

The best Python libraries to crawl a website

In order to crawl a website you can utilize two different libraries requests and selenium.

Both requests and selenium are popular Python libraries, but they serve different purposes.

What does the requests library do?

🕸️requests is like a spider crawling a web, used for fetching data from websites such as HTML pages, images, and videos. It's great for crawling sites that don't have dynamic content or require user interaction.

What does the selenium library do?

🌐selenium, on the other hand, is like a person browsing the web.

It allows you to automate web browsers, simulate user actions like clicking buttons or filling out forms, and extract data from websites.

It's useful for crawling sites that have dynamic content and require user interaction.

In conclusion

To summarize, if you need to crawl a website that doesn't have dynamic content or user interaction, requests is a good choice. But if you need to simulate user actions and interact with a website that has dynamic content, selenium is a better choice.

Example 1: using requests

First you will need to install the requests library

pip install requests
Run this in your terminal to install the requests library

Then you can run the following script to crawl every page contained in the urls list.

import requests

# List of URLs to crawl
urls = [
    "https://www.mywebsite.com",
    "https://www.mywebsite.com/page/1",
    "https://www.mywebsite.com/page/2",
    "https://www.mywebsite.com/page/3",
]

# Loop through each URL and make a request
for url in urls:
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # We print the html of the page
        print(response.text)
    else:
        print(f"Error: {response.status_code} - {response.reason}")
Crawling multiple pages using the requests library

Example 2: using selenium

In order to use selenium you will need to install it by running the following command in your terminal

pip install selenium
Run this in your terminal to install the selenium library

You will also need to install a webdriver in order for selenium  to simulate a user interaction throught a web browser.

To install a webdriver for selenium, you need to follow these steps to install the chromium webdriver:

  1. Download the webdriver executable for the browser you want to use. For example, if you want to use Chrome, download the Chrome webdriver from the official website: https://sites.google.com/a/chromium.org/chromedriver/downloads. Make sure to download the version that matches the version of your browser.
  2. Install the webdriver executable. Copy the downloaded executable to a location on your computer where it can be easily accessed, such as your project directory. Make sure to add the location of the webdriver to your system's PATH environment variable.
  3. Install the selenium module. You can install it using pip, which is a package installer for Python. Open a command prompt or terminal and type pip install selenium.

Once you have completed these steps, you can start using selenium with the webdriver you installed. When initializing the webdriver, make sure to specify the location of the webdriver executable, like so:

from selenium import webdriver

# Specify the location of the Chrome webdriver
driver = webdriver.Chrome("path/to/chromedriver")

Then you will be able to run the following script that will start the crawling of each urls.

from selenium import webdriver

# List of URLs to crawl
urls = [
    "https://www.example.com",
    "https://www.example.com/page1",
    "https://www.example.com/page2",
    "https://www.example.com/page3",
]

# Start a new Chrome browser window
driver = webdriver.Chrome()

# Loop through each URL and load the page
for url in urls:
    driver.get(url)

    # Print the title of the page
    print(driver.title)

# Close the browser window
driver.quit()
Crawling multiple pages using the selenium library

My webdriver doesn't work what should I do ?

If your webdriver doesnt work, I found a easy work around using the Firefox webdriver.

All you have to do is install the Firefox web browser available here.

And the change the script as so:

from selenium import webdriver

# List of URLs to crawl
urls = [
    "https://www.example.com",
    "https://www.example.com/page1",
    "https://www.example.com/page2",
    "https://www.example.com/page3",
]

# Start a new Chrome browser window
driver = webdriver.Firefox()

# Loop through each URL and load the page
for url in urls:
    driver.get(url)

    # Print the title of the page
    print(driver.title)

# Close the browser window
driver.quit()
Crawling multiple pages using the selenium library