Spark descriptive name for cached dataframes

Jul 23, 2019·
Dr. Georg Heiler
Dr. Georg Heiler
· 1 min read
blog

Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
    df.sparkSession.sharedState.cacheManager
      .cacheQuery(df, Some(name), storageLevel)
    df
  }

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Dr. Georg Heiler
Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.