Cloud arbitrage for spark pipelines
TLDR: Spark-based data PaaS solutions are convenient. But they come with their own set of challenges such as a high vendor lock-in and obscured costs. We show how to use a dedicated orchestrator (dagster-pipes). It can not only make Databricks an implementation detail but also save cost. Also, it improves developer productivity. It allows you to take back control.
Introduction
The big data landscape is changing fast. Spark-based PaaS solutions are part of this. They offer both convenience and power for data processing and analytics. But, this ease has downsides. These include lock-in risks, hidden operating costs, and scope creep. They all reduce developer productivity. This can lead to bad choices. It makes it easy to spend resources without understanding the cost. This often leads to inflated spending. In fact, most commercial platforms try to become all-encompassing. By this they are violating the unix philosophy of doing one thing - but one thing well. This results in further lock-in.
The data we are processing at ASCII is huge. One of the largest datasets we use is Commoncrawl. Every couple of months 400 TiB uncompressed and about 88 TiB of compressed data is added to the dataset. This means optimizing for performance and cost is crucial.
Inspired by the integration of Dagster, dbt and DuckDB for medium-sized datasets. This post shows how scaling this concept to extremely large-scale datasets works, whilst building on the same principles of:
- Containerization & testability & taking back control
- Partitions and powerful orchestration & easy backfills
- Developer productivity
- Cost control
We use Dagster`s remote execution integration (dagster-pipes). It abstracts the specific execution engine. We support the following flavours of Apache Spark:
- Pyspark
- Databricks
- EMR
This lets us make the Spark-based data PaaS platforms an implementation detail. It saves cost and boosts developer productivity. It also reduces lock-in. In particular, it lets us mix in Databricks’ extra capabilities where needed. But, we can run most workloads on EMR for less money. We can do this without changing the business logic. Also, following software engineering best practices for non-notebook development becomes very easy again. This results in a more maintainable codebase. We use ASCII in production. For it, we can observe huge cost savings due to:
- ✅ Flexible environment selection: One job processed commoncrawl data in spark on a single partition. It cost over 700€ on Databricks. There was approximately a 50% Databricks markup for convenience features. Now we only pay less than 400€.
- ✅ Developer productivity & taking back control: Using pyspark locally on small sample data allows rapid prototyping. No need to wait 10 minutes for cloud VMs to spin up. Furthermore, this allows for a fast development cycle & feedback loop.
- ✅ Flexible orchestration: We can easily add partitions and orchestrate steps. We can do this on-premise and in the cloud.