Parallel aggregation of dataframes

Dec 5, 2019·

Georg Heiler

· 1 min read

Spark is known for its great API and fairly easy to build widely scalable data processing jobs.

However, it is also known to be hard to fine tune.

One trick might be handy for you: parallel aggregations. As the resilient distributed datasets (RDD) backing data frames are immutable it can make a lot of sense to operate on them in parallel if your cluster has enough resources (free slots).

To do so start a spark shell:

spark-shell

and execute:

val df = Seq((1, "A"), (2, "A"), (3, "B")).toDF("value", "group")
df.show
+-----+-----+
|value|group|
+-----+-----+
|    1|    A|
|    2|    A|
|    3|    B|
+-----+-----+

Assuming you have several aggregations you want to perform - they are a good candidate to be faster in parallel.

val aggregations = Seq(1,2,3)

aggregations.par.foreach(a => {
  println(s"starting ${a}")
  df.filter('value === a).groupBy("value").count.write.parquet(s"result.parquet/group=${a}")
  println(s"completing ${a}")
})

The depicted aggregations are a bit too simplistic - but I think you understand the usefulness of this method.

If you truly want to proces the aggregations in parallel do not forget to set --conf spark.scheduler.mode='FAIR'. Otherwise tasks are submitted in parallel, but are handled in a FiFo queue. This still makes sense though if you have enough slots.