Dynamically select columns by type

Feb 22, 2019·
Dr. Georg Heiler
Dr. Georg Heiler
· 1 min read
blog

In pandas it is really easy to select only columns matching a certain data type:

df.select_dtypes(include=['float64'])

In spark, such a function is not included by default. However, it can easily be coded by hand:

val df  =  Seq(
  (1, 2, "hello")
).toDF("id", "count", "name")

import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {

  val cols = df.schema.toList
    .filter(x => x.dataType == colType)
    .map(c => col(c.name))
  df.select(cols:_*)

}
val res = selectByType(IntegerType, df)
Dr. Georg Heiler
Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.