Day 26 of 50 Days of Python: APIs and Web Scraping Overview
Part of Week 4: Python for Data Engineering
Welcome back for Day 26! We are officially over the half way hump of the 50 Days of Python series, I hope you’re all enjoying it so far!
Today we will be covering APIs and web scraping with the use of requests and beautiful soup. APIs can always be interacted with the requests package. However, organisations sometimes develop their own package/library when they want developers to easily pull data from their application.
With that, let’s jump into it!
Working with APIs (requests package)
Firstly, lets cover what API actually stands for, Application Programming Interface. They allow us to retrieve data from web services using HTTP requests. A common format for API responses is JSON.
Now that’s out the way lets forget the Spotify business we’ve been covering and utilise a free to use API that doesn’t require crazy setups:
import requests
# API endpoint
url = "https://api.open-meteo.com/v1/forecast?latitude=52.52&longitude=13.4050&daily=temperature_2m_max&timezone=Europe%2FBerlin"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
print(data["daily"]["temperature_2m_max"])
else:
print(f"Failed to fetch data: {response.status_code}")
Breaking down key components of the code block above:
→ request.get(url): Sends an HTTP GET request to the API.
→ response.json(): Parses the JSON response into a Python dictionary.
→ 200 code check: Making sure to check the response code is 200 before attempting to extract data.
Each API has specific documentation, so always read up on that to see what parameters you can pull related to that API. Example parameters of the open-meteo URL used above are:
longitude
latitude
timezone
Web Scraping with Beautiful Soup
In cases where you can’t access an API, or you’re wanting to extract data from a particular web page, beautiful Soup can make this easy.
import requests
from bs4 import BeautifulSoup
# Target URL
url = "https://news.ycombinator.com/"
# Get the content
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
# Extract article titles
titles = soup.find_all("a", class_="titlelink")
for i, title in enumerate(titles[:5], 1): # Print first 5 titles
print(f"{i}. {title.text}")
else:
print("Failed to retrieve webpage")
Again, lets break this down so it makes more sense:
→ requests.get(url): Fetches the web page content
→ BeautifulSoup(response.text, "html.parser"): Parses HTML into a structured format.
→soup.find_all("a", class_="titlelink"): Extracts links with the class titlelink.
One thing I should mention is that websites usually have a robots.txt file which can be accessed by typing the full URL of the homepage and adding “/robots.txt” at the end. This covers the websites rules around what you can and can’t scrape.
Next Up: Day 27 - File System Operations and Cloud Storage Integrations (Azure Specific)
Day 27 will have us going into detail of the os package that allows us to interact with the file system. Plus, learning about how we can directly interface with cloud storage solutions with Azure as the example.
So as always, join me for the next one and happy coding!