Why udf call in a dataframe select does not work? - scala

I have a sample dataframe as follows:
val df = Seq((Seq("abc", "cde"), 19, "red, abc"), (Seq("eefg", "efa", "efb"), 192, "efg, efz efz")).toDF("names", "age", "color")
And a user defined function as follows which replaces "color" column in df with the string length:
def strLength(inputString: String): Long = inputString.size.toLong
I am saving the udf reference for performance as follows:
val strLengthUdf = udf(strLength _)
And when I try to process the udf while performing the select it works if I don't have any other column names:
val x = df.select(strLengthUdf(df("color")))
scala> x.show
+----------+
|UDF(color)|
+----------+
| 8|
| 12|
+----------+
But when I want to pick other columns along with the udf processed column, I get the following error:
scala> val x = df.select("age", strLengthUdf(df("color")))
<console>:27: error: overloaded method value select with alternatives:
[U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String, org.apache.spark.sql.Column)
val x = df.select("age", strLengthUdf(df("color")))
^
What am I missing here val x = df.select("age", strLengthUdf(df("color")))?

You cannot mix Strings and Columns in a select statement.
This will work:
df.select(df("age"), strLengthUdf(df("color")))

Related

spark scala sameElement doesn't work as expected

I have a dataFrame of date :
val df = Seq(Date.valueOf("2020-01-01"), Date.valueOf("2020-11-11"), Date.valueOf("1992-04-10")).toDF("dt")
df.show
+----------+
| dt|
+----------+
|2020-01-01|
|2020-11-11|
|1992-04-10|
+----------+
using spark I add two months to that dateFrame :
df.select(add_months(df("dt"))
df.select(add_months(df("dt"), 2)).show
+-----------------+
|add_months(dt, 2)|
+-----------------+
| 2020-03-01|
| 2021-01-11|
| 1992-06-10|
+-----------------+
then I collected the result and try to see if its equals to an expected value (which, normaly does) ;
val expected = Array(Row("2020-03-01"), Row("2021-01-11"), Row("1992-06-10")
val actue = df.select(add_months(df("dt"), 2)).collect()
actue.sameElements(expected)
how ever it returns false
I also tried just one values it returns always false
scala> actue.sameElements(expected)
false
can anyone spot what is the problem ?
Method sameElements
def sameElements[B >: String](that: scala.collection.GenIterable[B]): Boolean
[B >: String] means this Type Parameter B must be either same as String or Super-Type of String
expected & actue are of type Array[org.apache.spark.sql.Row], org.apache.spark.sql.Row is not same as String or not super type of String.
Convert org.apache.spark.sql.Row to String and check if both are same or not. Check below code.
scala> val expected = Array("2020-03-01", "2021-01-11", "1992-06-10")
expected: Array[String] = Array(2020-03-01, 2021-01-11, 1992-06-10)
scala> val actue = df.select(add_months(df("dt"), 2).as("dt")).as[String].collect
actue: Array[String] = Array(2020-03-01, 2021-01-11, 1992-06-10)
scala> actue.sameElements(expected)
res8: Boolean = true

Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
row1|row2|row3
A1,B1,C1
A2,B2,C2
A3,B3,C3
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
function(value1,value2,value3)
Please help me.
emphasized text
You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
Example:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})

Selecting several columns from spark dataframe with a list of columns as a start

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)
Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)
I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>

Value and column operations in scala spark, how to use a value left of an operator with spark column?

I am trying to do some basic operations with Columns and Doubles and I can't figure out how to do it without creating a UDF.
scala> import org.apache.spark.sql.functions.col
scala> import spark.implicits._
scala> val df = Seq(("A", 1), ("B", 2), ("C", 3)).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
I want to find the reciprocal of col2, I would think to do that would be something like:
scala> df.withColumn("col3", 1/col("col2")).show
But that give this error:
<console>:30: error: overloaded method value / with alternatives:
(x: Double)Double <and>
(x: Float)Float <and>
(x: Long)Long <and>
(x: Int)Int <and>
(x: Char)Int <and>
(x: Short)Int <and>
(x: Byte)Int
cannot be applied to (org.apache.spark.sql.Column)
df.withColumn("col3", 1/col("col2")).show
Basically saying that you can't perform division (or any other operator) with a Double on the left hand side and a Column on the right. The only way I have been able to figure out how to do this is to create a UDF and apply it like this:
scala> def reciprocal(x: Double) : Double = {1/x}
reciprocal: (x: Double)Double
scala> val reciprocalUDF = spark.sqlContext.udf.register(
"reciprocalUDF", reciprocal _)
scala> df.withColumn("col3", reciprocalUDF(col("col2"))).show
+----+----+------------------+
|col1|col2| col3|
+----+----+------------------+
| A| 1| 1.0|
| B| 2| 0.5|
| C| 3|0.3333333333333333|
+----+----+------------------+
But really? Are UDFs the only way to do this sort of thing? I don't want to create a UDF every time I have to do some simple operation like division.
Use literal Column
import org.apache.spark.sql.functions.lit
lit(1) / col("col2")

Spark replace all NaNs to null in DataFrame API

I have a dataframe with many double (and/or float) columns, which do contain NaNs. I want to replace all NaNs (i.e. Float.NaN and Double.NaN) with null.
I can do this with e.g. for a single column x:
val newDf = df.withColumn("x", when($"x".isNaN,lit(null)).otherwise($"x"))
This works but I'd like to do this for all columns at once. I recently discovered the DataFrameNAFunctions (df.na) fill which sounds exactely what I need. Unfortunately I failed to do the above. fill should replace all NaNs and nulls with a given value, so I do:
df.na.fill(null.asInstanceOf[java.lang.Double]).show
which gives me a NullpointerException
There is also a promising replace method, but I cant even compile the code:
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show
strangely, this gives me
Error:(57, 34) type mismatch;
found : scala.collection.immutable.Map[scala.Double,java.lang.Double]
required: Map[Any,Any]
Note: Double <: Any, but trait Map is invariant in type A.
You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show
To replace all NaN(s) with null in Spark you just have to create a Map of replace values for every column, like this:
val map = df.columns.map((_, "null")).toMap
Then you can use fill to replace NaN(s) with null values:
df.na.fill(map)
For Example:
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
I hope this helps !
To Replace all NaN by any value in Spark Dataframe using Pyspark API you can do the following:
col_list = [column1, column2]
df = df.na.fill(replace_by_value, col_list)