Day 27 of 50 Days of Python: File System Operations and Azure Storage Integration
Part of Week 4: Python for Data Engineering
Welcome back for Day 27! Today, we will cover the os and pathlib packages for working with the file system. Then we’ll learn how to integrate cloud solutions into our code, focusing on Azure Blob Storage. Managing files efficiently is a key aspect of data engineer either that be locally or in the cloud.
Working With the File System
Python’s os and pathlib packages are great for working within the file system and provide the ability to read, create and manage files and directories. Let’s have a look at some basic file and directory handling using these packages:
import os
from pathlib import Path
# Creating a new directory
os.makedirs("test_directory", exist_ok=True)
# Create then write to a file
file_path = Path("test_directory/sample.txt")
with open(file_path, "w") as file:
file.write("Hello World!")
# Reading the file
with open(file_path, "r") as fle:
print("File contents:", file.read())
# Checking that a file exists
if file_path.exists():
print(f"{file_path} exists")
# Delete the file and directory
file_path.unlink()
os.rmdir("test_directory")
print("Cleanup Complete...")
Azure Blob Storage Integration
Microsoft Azure is a great platform for data engineering and data professionals to use as an alternative to AWS or Google Cloud. If you end up using Microsoft Azure you’ll become familiar with Azure Blob Storage where you’ll land data from various locations, host your data-lake and and handle any logging from services like Azure Databricks. We’ll be using the azure-blob-storage library to interact with this service.
# Azure Storage account connection string
connection_string = "<YOUR_AZURE_STORAGE_CONNECTION_STRING>"
container_name = "my-container"
blob_name = "sample.txt"
local_file_path = "sample.txt"
def upload_to_blob():
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
with open(local_file_path, "rb") as file:
container_client.upload_blob(blob_name, file, overwrite=True)
print("File uploaded successfully.")
def download_from_blob():
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
with open("downloaded_sample.txt", "wb") as file:
file.write(container_client.download_blob(blob_name).readall())
print("File downloaded successfully.")
# Run functions
upload_to_blob()
download_from_blob()
Listing All Blobs in a Container
def list_blobs():
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs()
for blob in blobs:
print(blob.name)
list_blobs()
Deleting a Blob
def delete_blob():
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
container_client.delete_blob(blob_name)
print(f"🗑️ {blob_name} deleted successfully.")
delete_blob()
Next Up: Day 28 - Introduction to PySpark for Big Data Processing.
Day 28 is our final day of the Data Engineering week and what better was to give an introduction to PySpark for use with Big Data. Pyspark is a huge tool in the Data Engineering space and the primary language used in Azure Databricks. So keep your eyes peeled for this one!
See you next time, and happy coding!