Spark descriptive name for cached dataframes

Jul 23, 2019·
Georg Heiler
Georg Heiler
· 1 min read

Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
    df.sparkSession.sharedState.cacheManager
      .cacheQuery(df, Some(name), storageLevel)
    df
  }

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Georg Heiler
Authors
senior data expert
My research interests include large geo-spatial time and network data analytics.