Making BigData small again (and green)
Many years ago, web companies like Google or Yahoo faced problems scaling their data processing landscape. In particular, the data size grew beyond the RAM of the computers of that time. The RAM, the IO and the reliability of a single machine were not big enough to cope with the massive size of the data. Scaling to more than one machine solved these issues but brought additional complexity: Any machine could fail and retries of computing tasks would be needed. Data might not be transferred due to unreliable networking or not in the expected order. Initially, analysts had to be distributed-systems software engineers to cope with these problems. This was not a scalable approach for mass adoption. Then, Hadoop came around to alleviate parts of these problems and allow the analysts to focus on the business logic. They needed to translate their analytics to map-reduce tasks. Nonetheless, this was already much simpler than before. Then came along the journey of enabling SQL-ish dialects in the Hadoop ecosystem. Beginning with Hive (on map-reduce then on TEZ) and following along Spark and continuing with massive-parallel-processing (MPP) systems like Presto, several systems were created with almost the compatibility of traditional SQL-based systems to talk to and analyze big data easily in a way analysts were used to. Some like Spark and Flink even broadened the scope of large-scale data processing to allow for near-realtime streaming data pipelines.
Big RAM is eating big data & the complexity of distributed systems
However, these very scalable tools are not for everyone:
They have their complexity and this not only at scale.
They are designed to scale out to 100s if not 1000s of compute nodes. This design choice brings considerable overhead when using them on a single node or a setup with 10s of compute nodes.
You Are Not Google and not Amazon - but you are not alone.
Think if you need such scalable tooling (and its overhead and complexity)!
Such deconstructed databases and data intensive applications give great power to the user.
But you are forced as the developer and user of these systems to fine-tune them and understand the intricate details about consistency or indexing.
No longer can you run a CREATE INDEX
SQL function and hope for the best.
This is certainly not for everyone!
The world outside of FAANG/MAANG has experienced stories where simple bash scripts were more than 100x faster than the solutions mentioned above due to CPU- and cache-local data processing:
Command-line Tools can be 235x Faster than your Hadoop Cluster
Furthermore, the world of big data begins where MS-Excel ends for many business users. For scientists, a similar story is often true and big data starts where the processing of data in pandas gets unusably slow. There is a huge gap between such small-scale data applications and massive processing needs. However, this gap has started to fill in recent years with some very nice tools.
Big RAM is eating big data: In recent years, TBs of RAM are cheaply accessible for many people! Often RAM is larger than most of the datasets that people want to process. Think twice if you really want to work with the complexity of a distributed system or just want to continue working in your tools of choice (probably R or python) but have it scale to larger data. With the availability of cheap RAM and the development of tools that fill this sweet spot, a lot has changed with developer productivity. You no longer need another second cluster only for Zookeeper to keep all the tools happily in sync. The modern distributed tools come with an integrated implementation of a distributed consensus algorithm like Paxos or Raft. They easily deploy to the various clouds, allowing for more efficient and deterministic long-tail latencies when processing data. But foremost, they can easily be set up on a local machine such as your laptop. This can dramatically increase the productivity of your developers.
complexity of distributed systems: Once at a big data conference, I was told that the good thing about the Hadoop ecosystem is that it is written in Java. I was told how Java is better than Cobol due to missing goto statements. However, in the discussion, we soon concluded that the Hadoop distribution offered for on-premise installation still caters to complex-distributed deployments. In particular, this is due to the missing unit and integration tests for the data pipelines and very complex state mutation. With all the services which need to run for the system to function (Zookeeper, storage system, cluster manager, monitoring) there is quite some overhead before you can start to run your analytics code. We concluded that the next step forward is only cloud-based deployments where compute and storage are truly separated. For software developers, heavily relying on cloud services LocalStack might be a solution. Nevertheless, for data developers, progress can only be made with access to the data: They need to be able to touch and feel it. Particularly on-premise Hadoop installations do have serious shortcomings at scale regarding IO isolation. It can be hard to prevent an analyst from paralyzing an operationally critical production system with a heavy query.
modern data tooling
In the following section, I will showcase some of the tools - and hope they can simplify your data pipelines, increase developer productivity, and help you save energy by allowing you to achieve more with less:
- less compute hours due to increased processing efficiency
- less development time due to straightforward APIs and simple local deployments for development
Some of these tools are not meant to process truly big datasets where only distributed IO is a solution. But many companies and people do not have these use cases and can save a vast amount of time and energy by adopting such modern data tools.
Most of the tools have a close relationship to Rust and compile down into a single binary - and therefore allow for a straightforward deployment (even in a multi-node setup), but are also super efficient for local development.
data analytics & science
First and foremost, one has to conclude that classical RDBMS-style databases have caught up a lot:
- processing of semi-structured data
- various addons for GPU, temporal, or geospatial processing
- parallel joins http://blog.cleverelephant.ca/2019/05/parallel-postgis-4.html
- integrations into the world of AI https://cloudblogs.microsoft.com/sqlserver/2018/09/24/sql-server-2019-preview-combines-sql-server-and-apache-spark-to-create-a-unified-data-platform/
These capabilities are now easily accessible from the tools that analysts are used to working with, namely traditional RDBMS database tools like PostgreSQL or Microsoft SQL-Server.
When Excel stops working pandas - perhaps with Mito for a business user - might already solve all your problems quickly. If you prefer working in R - then data.table might already solve all your performance issues. Once your dataset grows further, multi-core processing becomes essential to utilize the modern many-core CPUs fully.
The polar bear is much stronger than a koala bear. polars beats pandas wide by adopting easy multiprocessing. Like its established distributed-systems counterparts, it offers an API for very efficient lazy computation. However, by additionally offering a pandas-native API, analysts or data scientists gradually transitioning to larger datasets can scale their data processing skills without learning too many new APIs.
When you prefer to work with SQL, you probably have come around SQLite. It is meant as an OLTP database - nonetheless can be quite fast even for analytics as the tedious process of inserting data is often faster than other databases. However, it is not an OLAP analytical solution. DuckDB, a new system offers a similar easy deployment (simply pip install it). It behaves similar to polars regarding its scalability and offers an API in plain SQL. Given the advent of the modern data stack with (data build tool) DBT, this has the advantage that analysts can directly include very scalable SQL-based processing into their DBT workflows - even without the need for a central infrastructure. You can even run one instance of DuckDB in every lambda function for pretty much straightforward and infinite scalability without the hassle of the setup of a ClickHouse installation:
If you use @duckdb, you can get full in-process SQL capability right at the data source! At speeds approaching ClickHouse, it can absolutely fly. Lightning fast, with ability to directly query parquet or Arrow. Game changer!
— Alex Monahan (@__AlexMonahan__) March 20, 2021
In the domain of machine learning, graph and AI processing, where iterative algorithms process and learn from the data until they converge, accelerator chips like for example, GPUs or TPUs, can offer dramatic performance gains over traditional Hadoop- or Spark-based setups, where traditionally a lot of time is spent on shuffle operations:
- rapidsai: Nvidia has implemented various data frame, machine-learning, signal-processing, geospatial- and graph-analytics functions on the GPU with great success. There is an effort towards integrating RAPIDS with Spark on the way.
- Researchers in the domain of AI often like the easy numpy API. JAX pretty much follows this API but allows to compile to optimized native code and code for accelerators such as GPUs and TPUs.
streaming computation and messaging
RedPanda is built for lower tail latencies and a simple single binary deployment. With a kafka compatible API that includes a nice REST API & schema registry in the single binary, the developer experience is furthermore increased.
Redpanda’s performance capabilities were another aspect that caught our attention. We did a benchmark connecting Redpanda to our Databricks Spark cluster, and the difference in performance from Redpanda versus Kafka was orders of magnitude. We were able to thoroughly saturate hundred-to-one nodes on the Spark side versus the size of the Redpanda cluster. https://redpanda.com/blog/scaling-predictive-analytics-pipeline-redpanda/
materialize is built on top of timely-dataflow. By adhering to the API of Postgres, it is from the first second familiar to analysts. Computations are expressed as continuously materialized dataflows. The deployment is a single binary.
You can find some great examples here which often combine DBT, redpanda and materialize for simple, SQL-based, but very scalable near-realtime data processing:
- https://github.com/MaterializeInc/mz-hack-day-2022
- https://github.com/MaterializeInc/ecommerce-demo
- https://devdojo.com/bobbyiliev/how-to-use-dbt-with-materialize-and-redpanda
- https://github.com/joacoc/BlockchainTail
- One of my recent blog posts shows an integration of Materialize with DBT
summary
Using the modern hardware efficiently and compiling to a single binary, a lot of the tools introduced above are easy to deploy and very fast. These tools might for many cases make a cluster unnecessary or shrink the size of the cluster drastically. Given the increase of cheap RAM in recent years, even a single-node deployment on a beefy compute instance might solve most of your data needs. This saves energy and makes the field of data processing greener and dramatically increases the developer experience. Given only finite resources and energy, I think it can not be the right thing to continue working with big data tools on small data problems.
Furthermore, sometimes it makes much sense not to store all the data. By thinking about concrete use cases and using small and wide data to be more insightful and also less prone to legal problems and problems when there is a distribution shift in the features like happened during the COVID pandemic.
A big thank to https://illlustrations.co/ https://twitter.com/realvjy for the illustration of the featured image.