Python > Working with External Resources > Web Scraping with Beautiful Soup and Scrapy > Following Links
Web Scraping: Following Links with Beautiful Soup and Requests
This snippet demonstrates how to use the requests
library to fetch web pages and Beautiful Soup
to parse HTML content and follow links within a website. It showcases a basic web crawler that visits linked pages and extracts data.
Import Necessary Libraries
This section imports the requests
library for making HTTP requests and Beautiful Soup
for parsing HTML. BASE_URL
stores the root URL of the website we're scraping.
import requests
from bs4 import BeautifulSoup
# Base URL for the website you want to scrape
BASE_URL = 'https://quotes.toscrape.com'
Function to Extract Data from a Page
The extract_data
function takes a Beautiful Soup object (representing the parsed HTML of a page) as input. It finds all div
elements with the class quote
, which contain the quotes on the specified website. For each quote, it extracts the quote text and the author's name, then prints them.
def extract_data(soup):
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f'Quote: {text}\nAuthor: {author}\n---')
Function to Follow Links
The follow_links
function implements the core logic for crawling the website. It takes the URL to scrape and a visited
set to track already processed URLs. It first checks if the URL has already been visited to avoid infinite loops. Then, it makes a request to the URL, parses the HTML, extracts the data using extract_data
, and finds the 'Next' button. If the 'Next' button exists, it recursively calls follow_links
with the URL of the next page. Error handling is included to catch potential exceptions during the process. The response.raise_for_status()
method ensures that the request was successful; otherwise, it raises an HTTPError.
def follow_links(url, visited):
if url in visited:
return
print(f'Scraping: {url}')
visited.add(url)
try:
response = requests.get(url)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
soup = BeautifulSoup(response.content, 'html.parser')
extract_data(soup)
# Find the 'Next' button and follow the link
next_page = soup.find('li', class_='next')
if next_page:
next_url = BASE_URL + next_page.find('a')['href']
follow_links(next_url, visited)
except requests.exceptions.RequestException as e:
print(f'Error fetching {url}: {e}')
except Exception as e:
print(f'Error processing {url}: {e}')
Main Execution
This section initiates the web crawling process. It creates an empty set visited_urls
to keep track of visited URLs and then calls the follow_links
function with the base URL and the empty set to start the crawling process.
if __name__ == '__main__':
visited_urls = set()
follow_links(BASE_URL, visited_urls)
Concepts Behind the Snippet
Web Scraping: Automating the process of extracting data from websites.
HTML Parsing: Analyzing and structuring HTML code to identify relevant elements.
HTTP Requests: Sending requests to web servers to retrieve web pages.
Link Following: Identifying and navigating hyperlinks to explore multiple pages within a website.
Real-Life Use Case
E-commerce Price Monitoring: Scrape product prices from different online retailers and track price changes.
News Aggregation: Collect articles from various news websites and create a centralized news feed.
Research and Data Analysis: Gather data from online sources for academic or market research.
Best Practices
Respect Website's Terms of Service: Always review and adhere to the website's terms of service and robots.txt file.
Implement Rate Limiting: Avoid overloading the server by adding delays between requests.
Handle Errors Gracefully: Implement error handling to catch and manage potential exceptions.
Use User-Agent: Set a user-agent header to identify your scraper to the website.
Interview Tip
Be prepared to discuss the ethical considerations of web scraping, such as respecting website's terms of service and avoiding excessive requests. Also, be ready to explain how to handle dynamic content and anti-scraping measures.
When to Use Web Scraping
Use web scraping when you need to extract data from websites that don't provide an API, or when the available API doesn't provide the specific data you need. It's useful for data aggregation, price monitoring, research, and other tasks that involve gathering information from the web.
Memory Footprint
The memory footprint of this script depends on the size of the web pages being scraped and the amount of data being stored. Using generators and processing data in chunks can help reduce memory consumption, especially when dealing with large websites.
Alternatives
Web APIs: If the website provides an API, using it is generally the preferred method.
Data Feeds (RSS/Atom): Some websites offer data feeds that can be used to retrieve content updates.
Commercial Web Scraping Services: Services like Scrapy Cloud or Diffbot provide managed web scraping solutions.
Pros
Automation: Automates the process of data extraction, saving time and effort.
Customization: Allows you to extract specific data points tailored to your needs.
Flexibility: Can be adapted to scrape data from various websites.
Cons
Fragility: Web scraping scripts can break if the website's structure changes.
Ethical Concerns: Can be considered unethical if done without permission or if it violates the website's terms of service.
Complexity: Requires technical knowledge of HTML, CSS, and web technologies.
FAQ
-
How do I handle websites that use JavaScript to render content?
You can use a headless browser like Selenium or Puppeteer to execute the JavaScript and render the page before scraping it with Beautiful Soup. -
How do I avoid getting blocked by websites when scraping?
Implement rate limiting, use rotating proxies, set a user-agent header, and respect the website's robots.txt file. -
What is robots.txt?
It is a text file on a website that communicates which parts of the website should not be accessed by web crawlers.