Dynamically select columns by type

Feb 22, 2019·
Georg Heiler
Georg Heiler
· 1 min read

In pandas it is really easy to select only columns matching a certain data type:

df.select_dtypes(include=['float64'])

In spark, such a function is not included by default. However, it can easily be coded by hand:

val df  =  Seq(
  (1, 2, "hello")
).toDF("id", "count", "name")

import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {

  val cols = df.schema.toList
    .filter(x => x.dataType == colType)
    .map(c => col(c.name))
  df.select(cols:_*)

}
val res = selectByType(IntegerType, df)
Georg Heiler
Authors
senior data expert
My research interests include large geo-spatial time and network data analytics.