Spark descriptive name for cached dataframes

Jul 23, 2019·

Dr. Georg Heiler

· 1 min read

Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
    df.sparkSession.sharedState.cacheManager
      .cacheQuery(df, Some(name), storageLevel)
    df
  }

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Last updated on Feb 23, 2026

Apache-Spark Big-Data

Authors

Dr. Georg Heiler

senior data expert

Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.

← Geospatial binning with hexagons on spark Nov 20, 2019

Data links KW 28 Jul 14, 2019 →

No results found

Spark descriptive name for cached dataframes