Scala _* to select a list of dataframe columns - scala

I have a dataframe and a list of columns like this:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(("Java", "20000"), ("Python", "100000"))).toDF("language","users_count")
val data_columns = List("language","users_count").map(x=>col(s"$x"))
Why does this work:
df.select(data_columns:_ *).show()
But not this?
df.select($"language", data_columns:_*).show()
Gives the error:
error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
And how do I get it to work so I can use _* to select all columns in a list, but I also want to specify some other columns in the select?
Thanks!
Update:
based on #chinayangyangyong answer below, this is how I solved it:
df.select( $"language" +: data_columns :_*)

It is because there is no method on Dataframe with the signature select(col: Column, cols: Column*): DataFrame, but there is one with the signature select(col: Column*): DataFrame, which is why your first example works.
Interestingly, your second example would work if you were using String to select the columns since there is a method select(col: String, cols: String*): DataFrame.

df.select(data_columns.head, data_columns.tail:_*),show()

Related

How to repartition a dataframe based on more than one column?

I have a dataframe: yearDF with the following columns: name, id_number, location, source_system_name, period_year.
If I want to repartition the dataframe based on a column, I'd do:
yearDF.repartition('source_system_name')
I have a variable: val partition_columns = "source_system_name,period_year"
I tried to do it this way:
val dataDFPart = yearDF.repartition(col(${prtn_String_columns}))
but I get a compilation error: cannot resolve the symbol $
Is there anyway I can repartition the dataframe: yearDF based on the values in partition_columns
There are three implementations of the repartition function in Scala / Spark :
def repartition(partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int): Dataset[T]
So in order to repartition on multiple columns, you can try to split your field by the comma and use the vararg operator of Scala on it, like this :
val columns = partition_columns.split(",").map(x => col(x))
yearDF.repartition(columns: _*)
Another way to do it, is to call every col one by one :
yearDF.repartition(col("source_system_name"), col("period_year"))

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

spark select expression available only in R but not in Scala

What is the name of sparks selectExpr.html https://docs.databricks.com/spark/latest/sparkr/functions/selectExpr.html when using spark with Scala?
edit
how can I use it in a withColumn statement?
val scalarInput = 123
df.withColumn("foo", selectExpr("""someHiveUDF( ${scalarInput}, column)"""))
fails with
selectExpr not Found
The name is exactly the same, selectExpr:
def selectExpr(exprs: String*): DataFrame
expr from org.apache.spark.sql.functions._ is what you want!

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)

How to set column names to toDF() function in spark dataframe using a string array?

For example,
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns)
How can I set column name using string Array?
Is it possible to mention data types inside toDF()?
toDF() takes a repeated parameter of type String, so you can use the _* type annotation to pass a sequence:
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
For more on repeated parameters - see section 4.6.2 in the Scala Language Specification.
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF("column1", "column2", "column3")
toDF() takes comma-seperated strings
toDF() is defined in Spark documentation as:
def toDF(colNames: String*): DataFrame
And so you need to turn your array to a varargs as also described here. That means you need to do the following:
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
(Add : _* tocolumns in toDF)