Dynamically select columns by type

Feb 22, 2019·

Georg Heiler

· 1 min read

In pandas it is really easy to select only columns matching a certain data type:

df.select_dtypes(include=['float64'])

In spark, such a function is not included by default. However, it can easily be coded by hand:

val df  =  Seq(
  (1, 2, "hello")
).toDF("id", "count", "name")

import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {

  val cols = df.schema.toList
    .filter(x => x.dataType == colType)
    .map(c => col(c.name))
  df.select(cols:_*)

}
val res = selectByType(IntegerType, df)

Last updated on Nov 20, 2019

Apache-Spark Big-Data

Authors

Georg Heiler

senior data expert

My research interests include large geo-spatial time and network data analytics.

← Data links KW 8 Feb 23, 2019

Data links KW 7 Feb 15, 2019 →