Self-Hosting Airflow at Home: Automating Stock Price Data Collection
Summary
This article details the process of self-hosting Apache Airflow to automate stock price data collection for MLOps workflows, specifically within a homelab environment. It outlines configuring Airflow for robustness by daemonizing its scheduler, DAG processor, triggerer, and API server (port 8080) using systemd to ensure continuous operation and automatic restarts. The author then describes setting up a PostgreSQL connection within the Airflow UI, enabling data pipelines to interact with a database. A Python DAG is presented, utilizing the yfinance package to extract 5 years of historical Canadian equity data from a watchlist_cad.csv and load it into a finance.watchlist_cad_ticker_price table daily. Finally, the article covers deploying DAG code from a Windows development machine to the Airflow server's dags folder via a GitHub repository, emphasizing version control and streamlined updates, alongside future improvements like Docker for environment isolation and alerting.
Key takeaway
For MLOps Engineers or Data Engineers building automated data foundations, you should prioritize robust Airflow deployment by daemonizing components with systemd for continuous operation and leveraging Git for version-controlled DAG deployment. This approach ensures your data pipelines, such as collecting stock prices via yfinance into PostgreSQL, are resilient and easily updated. Consider Docker operators for future scalability to manage Python environment dependencies effectively and integrate alerting for pipeline failures.
Key insights
Self-hosting Airflow enables robust, automated data pipelines for MLOps, even on a homelab.
Principles
- Daemonize critical services for reliability.
- Version control simplifies deployment and updates.
- Isolate Python environments for DAGs.
Method
Daemonize Airflow components via systemd, configure PostgreSQL connection in UI, develop DAGs using yfinance and pandas, then deploy via Git to the dags folder, ensuring package dependencies are met.
In practice
- Use systemd for Airflow component resilience.
- Configure PostgresHook for database access.
- Deploy DAGs via Git clone/pull.
Topics
- Apache Airflow
- MLOps
- Data Pipelines
- PostgreSQL
- yfinance
- systemd
- Git Deployment
Code references
Best for: MLOps Engineer, Data Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.