spark scala sameElement doesn't work as expected - scala

I have a dataFrame of date :
val df = Seq(Date.valueOf("2020-01-01"), Date.valueOf("2020-11-11"), Date.valueOf("1992-04-10")).toDF("dt")
df.show
+----------+
| dt|
+----------+
|2020-01-01|
|2020-11-11|
|1992-04-10|
+----------+
using spark I add two months to that dateFrame :
df.select(add_months(df("dt"))
df.select(add_months(df("dt"), 2)).show
+-----------------+
|add_months(dt, 2)|
+-----------------+
| 2020-03-01|
| 2021-01-11|
| 1992-06-10|
+-----------------+
then I collected the result and try to see if its equals to an expected value (which, normaly does) ;
val expected = Array(Row("2020-03-01"), Row("2021-01-11"), Row("1992-06-10")
val actue = df.select(add_months(df("dt"), 2)).collect()
actue.sameElements(expected)
how ever it returns false
I also tried just one values it returns always false
scala> actue.sameElements(expected)
false
can anyone spot what is the problem ?

Method sameElements
def sameElements[B >: String](that: scala.collection.GenIterable[B]): Boolean
[B >: String] means this Type Parameter B must be either same as String or Super-Type of String
expected & actue are of type Array[org.apache.spark.sql.Row], org.apache.spark.sql.Row is not same as String or not super type of String.
Convert org.apache.spark.sql.Row to String and check if both are same or not. Check below code.
scala> val expected = Array("2020-03-01", "2021-01-11", "1992-06-10")
expected: Array[String] = Array(2020-03-01, 2021-01-11, 1992-06-10)
scala> val actue = df.select(add_months(df("dt"), 2).as("dt")).as[String].collect
actue: Array[String] = Array(2020-03-01, 2021-01-11, 1992-06-10)
scala> actue.sameElements(expected)
res8: Boolean = true

Related

Convert column containing values as List to Array

I have a spark dataframe as below:
+------------------------------------------------------------------------+
| domains |
+------------------------------------------------------------------------+
|["0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", |
| "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", |
| "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9", |
| "ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3", |
| "aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38", |
| "806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf", |
| "c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39"] |
+------------------------------------------------------------------------+
It contains column as list. I want to convert into Array[String].
eg:
Array("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9",
"ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3",
"aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38",
"806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf",
"c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39")
I tried the following code but I am not getting intended results:
DF.select("domains").as[String].collect()
Instead I get this:
[Ljava.lang.String;#7535f28 ...
Any ideas how can I achieve this ?
You can first explode your domains column before collecting it, as follows:
import org.apache.spark.sql.functions.{col, explode}
val result: Array[String] = DF.select(explode(col("domains"))).as[String].collect()
You can then print your result array using mkString method:
println(result.mkString("[", ", ", "]"))
Here you are getting the Array[String] only as expected.
[Ljava.lang.String;#7535f28 --> this is a kind of type descriptor we use internally in byte code. [ represents an array and Ljava.lang.String represents the Class java.lang.String.
If you want to print the array values as a string, you can use .mkString() function.
import spark.implicits._
val data = Seq((Seq("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9")))
val df = spark.sparkContext.parallelize(data).toDF("domains")
// df: org.apache.spark.sql.DataFrame = [domains: array<string>]
val array_values = df.select("domains").as[String].collect()
// array_values: Array[String] = Array([0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9])
val string_value = array_values.mkString(",")
print(string_value)
// [0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9]
This if you create normal arrays also, can see the same.
scala> val array_values : Array[String] = Array("value1", "value2")
array_values: Array[String] = Array(value1, value2)
scala> print(array_values)
[Ljava.lang.String;#70bf2681
scala> array_values.foreach(println)
value1
value2

Pass arguments to a udf from columns present in a list of strings

I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+

Iterate boolean comparison over two DataFrames?

I have created two DF sets, one with a generic number list and another with a specific number list. I want to iterate over the first list and compare it to the second list; if GenericList[X] is equal to any number in SpecificNumber list, I want a return of True and if not, False.
I have tried to utilize a if loop, something similar to for ( num <- List ) print (list) if .....
scala> val genericList = List(5,6,7,8,9,10)
scala> val df = genericList.toDF
scala> val specificList = List(5,-3,8)
Try with .exists and .contains functions to check the number.
scala> val genericList = List(5,6,7,8,9,10)
scala> val specificList = List(5,-3,8)
scala> genericList.exists(specificList.contains)
res1: Boolean = true
In Dataframe API:
scala> val genericList = List(5,6,7,8,9,10)
scala> val df = genericList.toDF
scala> val specificList = List(5,-3,8)
scala> df.withColumn("check",'value.isin(specificList:_*)).show()
+-----+-----+
|value|check|
+-----+-----+
| 5| true|
| 6|false|
| 7|false|
| 8| true|
| 9|false|
| 10|false|
+-----+-----+

Why udf call in a dataframe select does not work?

I have a sample dataframe as follows:
val df = Seq((Seq("abc", "cde"), 19, "red, abc"), (Seq("eefg", "efa", "efb"), 192, "efg, efz efz")).toDF("names", "age", "color")
And a user defined function as follows which replaces "color" column in df with the string length:
def strLength(inputString: String): Long = inputString.size.toLong
I am saving the udf reference for performance as follows:
val strLengthUdf = udf(strLength _)
And when I try to process the udf while performing the select it works if I don't have any other column names:
val x = df.select(strLengthUdf(df("color")))
scala> x.show
+----------+
|UDF(color)|
+----------+
| 8|
| 12|
+----------+
But when I want to pick other columns along with the udf processed column, I get the following error:
scala> val x = df.select("age", strLengthUdf(df("color")))
<console>:27: error: overloaded method value select with alternatives:
[U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String, org.apache.spark.sql.Column)
val x = df.select("age", strLengthUdf(df("color")))
^
What am I missing here val x = df.select("age", strLengthUdf(df("color")))?
You cannot mix Strings and Columns in a select statement.
This will work:
df.select(df("age"), strLengthUdf(df("color")))

Spark replace all NaNs to null in DataFrame API

I have a dataframe with many double (and/or float) columns, which do contain NaNs. I want to replace all NaNs (i.e. Float.NaN and Double.NaN) with null.
I can do this with e.g. for a single column x:
val newDf = df.withColumn("x", when($"x".isNaN,lit(null)).otherwise($"x"))
This works but I'd like to do this for all columns at once. I recently discovered the DataFrameNAFunctions (df.na) fill which sounds exactely what I need. Unfortunately I failed to do the above. fill should replace all NaNs and nulls with a given value, so I do:
df.na.fill(null.asInstanceOf[java.lang.Double]).show
which gives me a NullpointerException
There is also a promising replace method, but I cant even compile the code:
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show
strangely, this gives me
Error:(57, 34) type mismatch;
found : scala.collection.immutable.Map[scala.Double,java.lang.Double]
required: Map[Any,Any]
Note: Double <: Any, but trait Map is invariant in type A.
You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show
To replace all NaN(s) with null in Spark you just have to create a Map of replace values for every column, like this:
val map = df.columns.map((_, "null")).toMap
Then you can use fill to replace NaN(s) with null values:
df.na.fill(map)
For Example:
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
I hope this helps !
To Replace all NaN by any value in Spark Dataframe using Pyspark API you can do the following:
col_list = [column1, column2]
df = df.na.fill(replace_by_value, col_list)