GitHub Actions for Automated Web Scraping: Complete Guide
Learn to build automated web scraping pipelines with GitHub Actions. Schedule scrapers, store data, handle errors, and deploy production scraping workflows.
Web scraping is a powerful technique for extracting data from websites. When combined with GitHub Actions, you can build fully automated scraping pipelines that run on a schedule, store data, and trigger notifications without any manual intervention or server costs.
This guide covers everything you need to know about building production-grade web scraping workflows with GitHub Actions, from basic setup to advanced error handling and data storage.
Why GitHub Actions for Web Scraping?
Advantages
- Free tier: 2,000 minutes per month for free accounts
- No server management: GitHub handles infrastructure
- Built-in scheduling: Cron-like scheduling with
on.schedule - Version control: Your scraping code is tracked with Git
- Secrets management: Secure storage for API keys and credentials
- Integration: Easy integration with GitHub APIs, notifications, and storage
Limitations
- Timeout: Maximum 6 hours per job
- No persistent storage: Need external storage for scraped data
- IP restrictions: GitHub Actions runners use shared IP ranges that some sites block
- Memory: Limited to 7 GB RAM on standard runners
- No browser GUI: Headless browsers only (no display)
Setting Up Your First Scraping Workflow
Project Structure
web-scraper/
├── .github/
│ └── workflows/
│ └── scrape.yml
├── scraper.py
├── requirements.txt
└── README.md
Basic Workflow
# .github/workflows/scrape.yml
name: Web Scraper
on:
schedule:
- cron: "0 6 * * *" # Runs daily at 6:00 AM UTC
workflow_dispatch: # Allows manual trigger
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run scraper
run: python scraper.py
env:
API_KEY: ${{ secrets.API_KEY }}
Cron Schedule Syntax
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6)
│ │ │ │ │
* * * * *
Common schedules:
| Schedule | Cron | Description |
|---|---|---|
| Every hour | 0 * * * * | Runs at minute 0 of every hour |
| Daily at 6 AM UTC | 0 6 * * * | Runs once daily |
| Every 6 hours | 0 */6 * * * | Runs 4 times daily |
| Weekdays at 9 AM | 0 9 * * 1-5 | Runs Monday-Friday |
| First of month | 0 6 1 * * | Runs monthly |
Note: GitHub Actions uses UTC timezone. Adjust for your timezone accordingly.
Building the Scraper
Using BeautifulSoup (Static Pages)
# scraper.py
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
def scrape_prices():
url = "https://example.com/prices"
headers = {
"User-Agent": "Mozilla/5.0 (compatible; PriceScraper/1.0)"
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
items = []
for item in soup.select(".product"):
name = item.select_one(".name").get_text(strip=True)
price = item.select_one(".price").get_text(strip=True)
items.append({
"name": name,
"price": price,
"scraped_at": datetime.utcnow().isoformat()
})
return items
def save_to_json(data, filename="output.json"):
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
if __name__ == "__main__":
data = scrape_prices()
save_to_json(data)
print(f"Scraped {len(data)} items")
Using Playwright (JavaScript-Rendered Pages)
# scraper.py
from playwright.sync_api import sync_playwright
import json
from datetime import datetime
def scrape_dynamic_content():
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/dynamic", wait_until="networkidle")
items = []
for element in page.query_selector_all(".data-row"):
items.append({
"title": element.query_selector(".title").inner_text(),
"value": element.query_selector(".value").inner_text(),
"scraped_at": datetime.utcnow().isoformat()
})
browser.close()
return items
if __name__ == "__main__":
data = scrape_dynamic_content()
with open("output.json", "w") as f:
json.dump(data, f, indent=2)
print(f"Scraped {len(data)} items")
Workflow update for Playwright:
- name: Install Playwright
run: |
pip install playwright
playwright install chromium
- name: Run scraper
run: python scraper.py
Storing Scraped Data
Option 1: Commit to Repository
- name: Commit and push results
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add output.json
git diff --staged --quiet || git commit -m "Update scraped data $(date)"
git push
Option 2: GitHub Gist
- name: Update Gist
run: |
curl -X PATCH \
-H "Authorization: token ${{ secrets.GIST_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/gists/${{ secrets.GIST_ID }} \
-d '{"files": {"data.json": {"content": "'"$(cat output.json)"'"}}}'
Option 3: Google Sheets
import gspread
from oauth2client.service_account import ServiceAccountCredentials
def save_to_sheets(data):
scope = ["https://spreadsheets.google.com/feeds", "https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("credentials.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Scraped Data").sheet1
sheet.clear()
sheet.append_row(["Name", "Price", "Scraped At"])
for item in data:
sheet.append_row([item["name"], item["price"], item["scraped_at"]])
Option 4: Database (Supabase, Firebase, etc.)
import os
from supabase import create_client
def save_to_supabase(data):
supabase = create_client(
os.environ["SUPABASE_URL"],
os.environ["SUPABASE_KEY"]
)
supabase.table("scraped_data").insert(data).execute()
Error Handling and Reliability
Retry Logic
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def get_session():
session = requests.Session()
retry = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def scrape_with_retry():
session = get_session()
response = session.get("https://example.com", timeout=30)
response.raise_for_status()
return response.content
Notification on Failure
- name: Notify on failure
if: failure()
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-type: application/json' \
-d '{"text": "Scraper failed! Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'
Health Check Output
import json
from datetime import datetime
def create_health_report(success, items_count=0, error=""):
report = {
"status": "success" if success else "failure",
"timestamp": datetime.utcnow().isoformat(),
"items_scraped": items_count,
"error": error
}
with open("health.json", "w") as f:
json.dump(report, f, indent=2)
Handling Anti-Scraping Measures
Rotating User Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
Rate Limiting
import time
def scrape_with_delay(urls):
results = []
for url in urls:
response = requests.get(url, headers=headers, timeout=30)
results.append(parse(response))
time.sleep(2) # 2 second delay between requests
return results
Proxy Usage
- name: Run scraper with proxy
run: python scraper.py
env:
PROXY_URL: ${{ secrets.PROXY_URL }}
import os
proxies = {
"http": os.environ.get("PROXY_URL"),
"https": os.environ.get("PROXY_URL"),
}
response = requests.get(url, proxies=proxies, timeout=30)
Complete Production Workflow
name: Production Web Scraper
on:
schedule:
- cron: "0 6 * * *"
workflow_dispatch:
env:
PYTHON_VERSION: "3.12"
jobs:
scrape:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: "pip"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run scraper
run: python scraper.py
env:
API_KEY: ${{ secrets.API_KEY }}
DATABASE_URL: ${{ secrets.DATABASE_URL }}
- name: Upload output as artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: scraped-data
path: output.json
retention-days: 30
- name: Commit data if changed
if: success()
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add data/
git diff --staged --quiet || git commit -m "chore: update scraped data"
git push
- name: Notify on failure
if: failure()
run: |
curl -X POST "${{ secrets.NOTIFICATION_WEBHOOK }}" \
-H 'Content-type: application/json' \
-d '{"text": "Scraper failed: ${{ github.run_id }}"}'
Data Processing and Transformation
Raw scraped data often needs cleaning and transformation before it is useful.
Data Cleaning
import re
import pandas as pd
def clean_scraped_data(data):
df = pd.DataFrame(data)
# Remove duplicates
df = df.drop_duplicates(subset=["url"])
# Clean text fields
df["title"] = df["title"].str.strip()
df["price"] = df["price"].str.replace("[^0-9.]", "", regex=True).astype(float)
# Handle missing values
df = df.dropna(subset=["title", "price"])
# Filter by criteria
df = df[df["price"] > 0]
return df.to_dict("records")
Data Enrichment
def enrich_data(data):
for item in data:
# Add derived fields
item["price_category"] = (
"low" if item["price"] < 1000
else "medium" if item["price"] < 5000
else "high"
)
item["processed_at"] = datetime.utcnow().isoformat()
return data
Monitoring and Alerting
Slack Notifications
- name: Notify success
if: success()
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-type: application/json' \
-d '{"text": "Scraper completed successfully. ${{ steps.count.outputs.items }} items scraped."}'
Email Notifications
import smtplib
from email.mime.text import MIMEText
def send_alert(subject, body, to_email):
msg = MIMEText(body)
msg["Subject"] = subject
msg["From"] = "scraper@example.com"
msg["To"] = to_email
with smtplib.SMTP("smtp.gmail.com", 587) as server:
server.starttls()
server.login("your_email@gmail.com", "app_password")
server.send_message(msg)
Dashboard Integration
Store scraped data in a database and build a simple dashboard to visualize trends over time. Tools like:
- Grafana: For time-series visualization
- Metabase: For SQL-based dashboards
- Google Data Studio: For free reporting
- Custom HTML dashboard: Serve static files from your repository
Optimizing GitHub Actions Costs
Reduce Runtime
- Cache pip dependencies to speed up installation
- Use minimal Docker images
- Avoid unnecessary steps in workflow
Use Self-Hosted Runners
For heavy scraping workloads, self-hosted runners on a VPS can be more cost-effective than GitHub-provided runners.
jobs:
scrape:
runs-on: self-hosted
steps:
# Same steps as before
Parallel Scraping
For large-scale scraping, split the workload across multiple jobs:
jobs:
scrape-part-1:
runs-on: ubuntu-latest
steps:
- run: python scraper.py --range 1-1000
scrape-part-2:
runs-on: ubuntu-latest
steps:
- run: python scraper.py --range 1001-2000
scrape-part-3:
runs-on: ubuntu-latest
steps:
- run: python scraper.py --range 2001-3000
Best Practices
Do
- Respect
robots.txtand website terms of service - Add delays between requests to avoid overwhelming servers
- Use descriptive User-Agent strings identifying your bot
- Implement proper error handling and retry logic
- Store secrets in GitHub Secrets, never in code
- Monitor scraper health with notifications
- Cache responses when possible to reduce load
- Use structured data formats (JSON, CSV)
Do Not
- Scrape personal or sensitive data without consent
- Ignore rate limits or terms of service
- Hardcode credentials or API keys
- Scrape at excessive frequency
- Ignore website server load
- Store scraped data in plain text in public repositories
Legal and Ethical Considerations
Before scraping any website:
- Check robots.txt:
https://example.com/robots.txt - Review Terms of Service: Many sites explicitly prohibit scraping
- Respect rate limits: Do not overwhelm servers
- Data privacy: Do not scrape personal data without consent
- Copyright: Scraped content may be copyrighted
- Commercial use: Using scraped data commercially may have legal implications
Alternatives to GitHub Actions
If GitHub Actions limitations do not meet your needs:
- AWS Lambda + EventBridge: For more complex scraping with custom runtimes
- Google Cloud Functions + Cloud Scheduler: Serverless with free tier
- Cloudflare Workers: For lightweight scraping at edge
- Self-hosted runners: For longer timeouts and custom environments
- Dedicated scraping services: Apify, ScrapingBee, ScraperAPI
Real-World Use Cases
Price Tracking
Monitor product prices across e-commerce platforms and receive alerts when prices drop below your target.
def track_prices():
products = [
{"url": "https://amazon.in/product1", "target": 5000},
{"url": "https://flipkart.com/product2", "target": 3000},
]
for product in products:
price = scrape_price(product["url"])
if price <= product["target"]:
send_alert(f"Price alert: {product['url']} is now Rs {price}!")
Job Board Aggregation
Scrape multiple job boards and consolidate listings into a single database.
def aggregate_jobs():
sources = [
{"url": "https://naukri.com/search", "selector": ".job-card"},
{"url": "https://linkedin.com/jobs", "selector": ".job-listing"},
{"url": "https://indeed.co.in/jobs", "selector": ".result"},
]
all_jobs = []
for source in sources:
jobs = scrape_jobs(source["url"], source["selector"])
all_jobs.extend(jobs)
# Deduplicate by job title + company
unique_jobs = deduplicate(all_jobs)
save_to_database(unique_jobs)
News and Content Monitoring
Track news websites for specific keywords or topics and compile daily digests.
Real Estate Listing Aggregation
Monitor property listings across multiple real estate portals and track price trends in specific areas.
Stock Market Data Collection
Scrape financial data, earnings reports, and market indicators for analysis and backtesting.
Debugging Failed Workflows
When a scraper fails, debugging in GitHub Actions requires specific techniques:
Enable Debug Logging
- name: Run scraper with debug
run: python scraper.py --verbose
env:
ACTIONS_RUNNER_DEBUG: true
ACTIONS_STEP_DEBUG: true
Save Screenshots for Headless Browser
from playwright.sync_api import sync_playwright
def debug_scrape():
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
page.screenshot(path="debug-screenshot.png")
browser.close()
Upload the screenshot as an artifact to visually inspect what went wrong.
Check GitHub Actions Runner Environment
- name: Debug environment
run: |
echo "Python version: $(python --version)"
echo "Installed packages:"
pip list
echo "Current directory: $(pwd)"
echo "Files:"
ls -la
Final Thoughts
GitHub Actions provides a free, reliable, and well-integrated platform for automated web scraping. With proper error handling, data storage, and monitoring, you can build production-grade scraping pipelines that run without manual intervention.
The key to success is respecting website policies, implementing robust error handling, and choosing the right storage solution for your use case. Start simple, iterate, and scale your scraping infrastructure as your needs grow.
Disclaimer: This article is for educational purposes only. Always respect website terms of service, robots.txt directives, and applicable laws when web scraping. Unauthorized scraping may violate terms of service or applicable laws. Consult legal counsel for your specific use case.