Spark descriptive name for cached dataframes
Jul 23, 2019·
·
1 min read
Dr. Georg Heiler
Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to?
Usually no specific name is set.
When you apply a df.cache spark will auto generate the name as a snippet from the query plan.
But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.
However, there is a better way:
def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
df: DataFrame): DataFrame = {
df.sparkSession.sharedState.cacheManager
.cacheQuery(df, Some(name), storageLevel)
df
}
one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII.
He is solving challenges with data. His interests include geospatial graphs
and time series. Georg transitions the data platform of Magenta to the cloud
and is handling large scale multi-modal ML-ops challenges at ASCII.