How to apply Spark's col() to Array[String] containing the columns name? - scala

If I have a Array[String] that contains the columns I need to use in select() function, how I can apply them in the most designed way?
.select(from_json(col("value").cast("string"), schema).as("data"), col("oneColumn"))
I'd like to put several columns with names from the array in the place of col("oneColumn")
ANswers from here can't help me, as they deal with Lists of Strings, while I already have a Column object and can't apply collection of columns as a parameter of select()

preparing list of columns
val cols: List[Column] = headers.toList.map(name => col(name))
val cols1 = cols :+ from_json(col("value").cast("string"), schema).as("data")
and then
.select(cols1: _*)

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,
You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)
Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

Difference b/w val df = List(("amit","porwal")) and val df = List("amit","porwal")

what is the difference b/w
val df = List(("amit","porwal"))
and
val df = List("amit","porwal")
My question is how 2 parenthesis are making a difference.Because On doing
scala > val df = List(("amit","porwal")).toDF("fname","lname")
it works, but on doing
scala > val df = List("amit","porwal").toDF("fname","lname")
scala throws me an error
java.lang.IllegalArgumentException: requirement failed:
The number of columns doesn't match. Old column names (1): value New column names (2):
fname,lname –
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:393)
at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44)
... 48 elided
Yes, they are different. The paranthesis inside is treated as tuples by scala compiler. Since there are two string values inside the nested brackets of your first example, it will be treated as Tuple2(String, String). While the second example the string values inside the List are treated as separate elements as String.
the first one val df = List(("amit","porwal")) is List[Tuple2(String, String)]. There is only one element in df and to get porwal you have to do df(0)._2
And,
the second one val df = List("amit","porwal") is List[String]. There are two elements in df and to get porwal you have to do df(1)
Even though the question is not related to spark
val df = List(("amit","porwal"))
Here df is list of Tuple2 as List[(String, String)], To get the value "amit" you should use df(0)._1 and for "porwal" df(0)._2
val df = List("amit","porwal")
Here is df is simply list of String as List[String]
In case of List[String] you can simply get as df(0) and df(1)
Hope this helps!

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)

Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names

val columnName=Seq("col1","col2",....."coln");
Is there a way to do dataframe.select operation to get dataframe containing only the column names specified .
I know I can do dataframe.select("col1","col2"...)
but the columnNameis generated at runtime.
I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?
val columnNames = Seq("col1","col2",....."coln")
// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
Alternatively, you can also write like this
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(DF(_): _*)