How to crawl multiple web pages using Python
• 3 minWhat is a web crawler
A web crawler, also called a spider or robot, is like a digital explorer that automatically visits websites and collects information.
They're mostly used by search engines like Google to find and organize web pages so they can show you relevant search results.
Web crawlers can also be used for other things like collecting data or checking for changes on websites.
Some websites don't like web crawlers, so they might block them from visiting their site.
The best Python libraries to crawl a website
In order to crawl a website you can utilize two different libraries requests
and selenium
.
Both requests
and selenium
are popular Python libraries, but they serve different purposes.
What does the requests library do?
🕸️requests
is like a spider crawling a web, used for fetching data from websites such as HTML pages, images, and videos. It's great for crawling sites that don't have dynamic content or require user interaction.
What does the selenium library do?
🌐selenium
, on the other hand, is like a person browsing the web.
It allows you to automate web browsers, simulate user actions like clicking buttons or filling out forms, and extract data from websites.
It's useful for crawling sites that have dynamic content and require user interaction.
In conclusion
To summarize, if you need to crawl a website that doesn't have dynamic content or user interaction, requests
is a good choice. But if you need to simulate user actions and interact with a website that has dynamic content, selenium
is a better choice.
Example 1: using requests
First you will need to install the requests library
Then you can run the following script to crawl every page contained in the urls list.
Example 2: using selenium
In order to use selenium you will need to install it by running the following command in your terminal
You will also need to install a webdriver in order for selenium to simulate a user interaction throught a web browser.
To install a webdriver for selenium, you need to follow these steps to install the chromium webdriver:
- Download the webdriver executable for the browser you want to use. For example, if you want to use Chrome, download the Chrome webdriver from the official website: https://sites.google.com/a/chromium.org/chromedriver/downloads. Make sure to download the version that matches the version of your browser.
- Install the webdriver executable. Copy the downloaded executable to a location on your computer where it can be easily accessed, such as your project directory. Make sure to add the location of the webdriver to your system's PATH environment variable.
- Install the selenium module. You can install it using pip, which is a package installer for Python. Open a command prompt or terminal and type pip install selenium.
Once you have completed these steps, you can start using selenium with the webdriver you installed. When initializing the webdriver, make sure to specify the location of the webdriver executable, like so:
from selenium import webdriver
# Specify the location of the Chrome webdriver
driver = webdriver.Chrome("path/to/chromedriver")
Then you will be able to run the following script that will start the crawling of each urls.
My webdriver doesn't work what should I do ?
If your webdriver doesnt work, I found a easy work around using the Firefox webdriver.
All you have to do is install the Firefox web browser available here.
And the change the script as so: