Spark UDF input with a List of columns - scala

I have a dataframe with N columns and I want to create a new column with the number of columns that have a NULL value. I tried to create an UDF but it's not working because of I can't set an array of parameters.
val simpleData = Seq(
("row1", "NULL" , "NULL" , "NULL" , "NULL" , "NULL", "1"),
("row2", "1", "NULL", "2023", "NULL", "01", "NULL"))
val myDs = simpleData.toDF("row", "field1", "field2", "field3", "field4", "field5", "field6")
myDs.show()
val windowcols = myDs.columns.filterNot(List("row").contains(_))
def countNullsUDF: UserDefinedFunction = udf { (values: List[String]) =>
values.filter( value => value == "NULL").length
}
myDs.withColumn("columnsWithNull", countNullsUDF(windowcols)).show(10, false)
is it possible to pass it an Array of columns or similar? I didn't get it.

Actually what you did is almost correct. You can't pass a list of columns to a UDF, however, you can group all columns into one array column, and then pass that array column to your UDF:
import org.apache.spark.sql.functions.{array, col}
// ...
myDs.withColumn(
"columnsWithNull",
countNullsUDF(array(windowcols.map(col): _*))
).show(10, false)
+----+------+------+------+------+------+------+---------------+
|row |field1|field2|field3|field4|field5|field6|columnsWithNull|
+----+------+------+------+------+------+------+---------------+
|row1|NULL |NULL |NULL |NULL |NULL |1 |5 |
|row2|1 |NULL |2023 |NULL |01 |NULL |3 |
+----+------+------+------+------+------+------+---------------+
I only needed to transform your list of column names into a list of columns with the .map(col), and to use : _* to expand the list as array takes multiple parameters.

Related

Spark Scala UDF to count number of array elements contained in another string column

I have a spark dataframe df with 2 columns, say A and B, where A is array of string type and B is a string.
For each row, I am trying to count how many elements in A are contained in B. The UDF I have written is as follows. I thought it should be easy but it breaks down in the subsequent action step.
val hasAddressInUDF = udf{(s: String, t: Array[String]) => t.filter(word => s.contains(word)).size}
Could anyone help? Thanks.
You should be able to use the Seq.count method for this one: for each element in the sequence you can count whether it is a substring of your B column.
import spark.implicits._
val df = Seq(
(Seq("potato", "meat", "car"), "potatoes with meat"),
(Seq("ice", "cream", "candy"), "pasta with cream"),
(Seq("crackers", "with", "hummus"), "tasty food")
).toDF("A", "B")
df.show(false)
+------------------------+------------------+
|A |B |
+------------------------+------------------+
|[potato, meat, car] |potatoes with meat|
|[ice, cream, candy] |pasta with cream |
|[crackers, with, hummus]|tasty food |
+------------------------+------------------+
def count_in_seq = udf((A: Seq[String], B: String) => A.count(elem => B.contains(elem)))
df.withColumn("counts", count_in_seq($"A", $"B")).show(false)
+------------------------+------------------+------+
|A |B |counts|
+------------------------+------------------+------+
|[potato, meat, car] |potatoes with meat|2 |
|[ice, cream, candy] |pasta with cream |1 |
|[crackers, with, hummus]|tasty food |0 |
+------------------------+------------------+------+
Hope this helps!

how to count field with condition by spark

I have a dataframe, there is a enum field(value are 0 or 1) named A, another one field B, I would like to implement below scenario:
if `B` is null:
count(when `A` is 0) and set a column name `xx`
count(when `A` is 1) and set a column name `yy`
if `B` is not null:
count(when `A` is 0) and set a column name `zz`
count(when `A` is 1) and set a column name `mm`
how can I do it by spark scala?
It's possible to conditionally populate columns in this way, however the final output DataFrame requires an expected schema.
Assuming all of the scenarios you detailed are possible in one DataFrame, I would suggest creating each of the four columns: "xx", "yy", "zz" and "mm" and conditionally populating them.
In the below example I've populated the values with either "found" or "", primarily to make it easy to see where the values are populated. Using true and false here, or another enum, would likely make more sense in the real world.
Starting with a DataFrame (since you didn't specify the type that "B" is I have gone for a Option[String] (nullable) for this example:
val df = List(
(0, None),
(1, None),
(0, Some("hello")),
(1, Some("world"))
).toDF("A", "B")
df.show(false)
gives:
+---+-----+
|A |B |
+---+-----+
|0 |null |
|1 |null |
|0 |hello|
|1 |world|
+---+-----+
and to create the columns:
df
.withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
.withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
.withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
.withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
.show(false)
gives:
+---+-----+-----+-----+-----+-----+
|A |B |xx |yy |zz |mm |
+---+-----+-----+-----+-----+-----+
|0 |null |found| | | |
|1 |null | |found| | |
|0 |hello| | |found| |
|1 |world| | | |found|
+---+-----+-----+-----+-----+-----+

Replace the value of one column from another column in spark dataframe

I have a dataframe like below
+---+------------+----------------------------------------------------------------------+
|id |indexes |arrayString |
+---+------------+----------------------------------------------------------------------+
|2 |1,3 |[WrappedArray(3, Str3), WrappedArray(1, Str1)] |
|1 |2,4,3 |[WrappedArray(2, Str2), WrappedArray(3, Str3), WrappedArray(4, Str4)] |
|0 |1,2,3 |[WrappedArray(1, Str1), WrappedArray(2, Str2), WrappedArray(3, Str3)] |
+---+------------+----------------------------------------------------------------------+
i want to loop through arrayString and get the first element as index and second element as String. Then replace the indexes with String corresponding to the index in arrayString. I want an output like below.
+---+---------------+
|id |replacedString |
+---+---------------+
|2 |Str1,Str3 |
|1 |Str2,Str4,Str3 |
|0 |Str1,Str2,Str3 |
+---+---------------+
I tried using the below udf function.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val itemIndexArray = itemIndex.split("\\,")
arrayString.map(i => {
itemIndexArray.updated(i(0).toInt,i(1))
})
itemIndexArray
})
This is giving me error and i am not getting my desired output. Is there any other way to achieve this. I cant use explode and join as i want the indexes replaced with string without losing the order.
.
You can create an udf as below to get the required result, Convert to the Array of array to map and find the index as a key in map.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val indexList = itemIndex.split("\\,")
val array = arrayString.map(x => (x(0) -> x(1))).toMap
indexList map array mkString ","
})
dataframe.withColumn("arrayString", replaceIndex($"indexes", $"arrayString"))
.show( false)
Output:
+---+-------+--------------+
|id |indexes|arrayString |
+---+-------+--------------+
|2 |1,3 |Str1,Str3 |
|1 |2,4,3 |Str2,Str4,Str3|
|0 |1,2,3 |Str1,Str2,Str3|
+---+-------+--------------+
Hope this helps!

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

How to return selectively multiple rows from one rows in Scala

There is a DataFrame, "rawDF" and its columns are
time |id1|id2|...|avg_value|max_value|min_value|std_value|range_value|..
10/1/2015|1 |3 |...|0.0 |0.2 |null |null |null | ...
10/2/2015|2 |3 |...|null |null |0.3 |0.4 |null | ...
10/3/2015|3 |5 |...|null |null |null |0.4 |0.5 | ...
For each row, I'd like to return multiple rows based on this five "values" (avg, min, max, std, range). But, if the value is null, I'd like to skip.
So, so output should be
10/1/2015|1 |3 |...|0.0
10/1/2015|1 |3 |...|0.2
10/2/2015|2 |3 |...|0.3
10/2/2015|2 |3 |...|0.4
10/3/2015|3 |5 |...|0.4
10/3/2015|3 |5 |...|0.5
I'm not much familiar with Scala, so, I'm struggling with this.
val procRDD = rawDF.flatMap( x => for(valInd <-10 to 14) yield {
if(x.get(valInd) != null) { ...)) }
} )
This code includes null return.
So, can you give me some idea?
It is a little bit strange requirement but as long as you don't need information about the source column and all values are of the same type you can simply explode and drop nulls:
import org.apache.spark.sql.functions.{array, explode}
val toExpand = Seq(
"avg_value", "max_value", "min_value", "std_value", "range_value"
)
// Collect *_value columns into a single Array and explode
val expanded = df.withColumn("value", explode(array(toExpand.map(col): _*)))
val result = toExpand
.foldLeft(expanded)((df, c) => df.drop(c)) // Drop obsolete columns
.na.drop(Seq("value")) // Drop rows with null value
Here is my solution. If you have better one, let me know.
val procRDD = rawDF.flatMap( x =>
for(valInd <-10 to 14) yield { // valInd represents column number
if(x.get(valInd) != null) {
try { Some( ..) }
catch { case e: Exception => None}
}else None
})
.filter({case Some(y) => true; case None=> false})
.map(_.get)
Actually, I was looking for filter and map and how to put commands inside.