Element position from nested DataFrame array (Spark 2.2) - scala

I'm trying to explode a nested DataFrame in Spark Scala. I have a DataFrame df which contains the following information:
root
|-- id: integer (nullable = false)
|-- features: array (nullable = true)
| |-- element: float (containsNull = false)
I've exploded the array information into to a flat DataFrame with:
df.selectExpr("id","explode(features) as features")
and got the following DataFrame:
id features
0 0.0629885
0 0.15931357
0 0.08922347
My end goal is to pivot the data and calculate some similarities with that information. To do that, it would be very cool to get the actual position of the feature for every ID into the DataFrame, like this:
id features feature_pos
0 0.0629885 0
0 0.15931357 1
0 0.08922347 2

Use posexplode in place of explode:
Creates a new row for each element with position in the given array or map column.
Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.

Here is the example with posexplode.
scala> val df = Seq((0, Seq(0.1f, 0.2f, 0.3f)),(1, Seq(0.4f, 0.5f, 0.6f))).toDF("id", "features")
df: org.apache.spark.sql.DataFrame = [id: int, features: array<float>]
scala> df.show(false)
+---+---------------+
|id |features |
+---+---------------+
|0 |[0.1, 0.2, 0.3]|
|1 |[0.4, 0.5, 0.6]|
+---+---------------+
Note that df.withColumn("pos cols",posexplode('features)).show(false) will throw error, so use df.select()
scala> df.select(posexplode('features)).show(false)
+---+---+
|pos|col|
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
The default names are "pos" and "col". You can rename them as
scala> df.select(posexplode('features).as(Seq("a","b"))).show(false)
+---+---+
|a |b |
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
When you want to explode and select all columns, use
scala> df.select(col("*"), posexplode('features).as( Seq("a","b")) ).show(false)
+---+---------------+---+---+
|id |features |a |b |
+---+---------------+---+---+
|0 |[0.1, 0.2, 0.3]|0 |0.1|
|0 |[0.1, 0.2, 0.3]|1 |0.2|
|0 |[0.1, 0.2, 0.3]|2 |0.3|
|1 |[0.4, 0.5, 0.6]|0 |0.4|
|1 |[0.4, 0.5, 0.6]|1 |0.5|
|1 |[0.4, 0.5, 0.6]|2 |0.6|
+---+---------------+---+---+
scala>

You can also apply Scala's zipWithIndex via a UDF as follows:
val df = Seq(
(0, Seq(0.1f, 0.2f, 0.3f)),
(1, Seq(0.4f, 0.5f, 0.6f))
).toDF("id", "features")
def addIndex = udf(
(s: Seq[Float]) => s.zipWithIndex
)
val df2 = df.withColumn( "features_idx", explode(addIndex($"features")) )
df2.select( $"id", $"features_idx._1".as("features"), $"features_idx._2".as("features_pos") ).show
+---+--------+------------+
| id|features|features_pos|
+---+--------+------------+
| 0| 0.1| 0|
| 0| 0.2| 1|
| 0| 0.3| 2|
| 1| 0.4| 0|
| 1| 0.5| 1|
| 1| 0.6| 2|
+---+--------+------------+

Related

Add values to a dataframe against some particular ID in Spark Scala

I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

How to find the max String length of a column in Spark using dataframe?

I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.
I have written the below code but the output here is the max length only but not its corresponding value.
This How to get max length of string column from dataframe using scala? did help me out in getting the below query.
df.agg(max(length(col("city")))).show()
Use row_number() window function on length('city) desc order.
Then filter out only the first row_number column and add length('city) column to dataframe.
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
(or)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
Update:
Find min,max values:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
Result:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.
Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
.toDF("city","num","country")
val dfWithLength = df.withColumn("city_length", length($"city")).cache()
dfWithLength.show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| A| 1| US| 1|
| AB| 1| US| 2|
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()
dfWithLength.filter($"city_length" === maxValue).show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
Find a maximum string length on a string column with pyspark
from pyspark.sql.functions import length, col, max
df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")

How to randomly choose element in array column of different size?

Given a dataframe with a column of arrays of integers with different sizes:
scala> sampleDf.show()
+------------+
| arrays|
+------------+
|[15, 16, 17]|
|[15, 16, 17]|
| [14]|
| [11]|
| [11]|
+------------+
scala> sampleDf.printSchema()
root
|-- arrays: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to generate a new column with a random chosen item in each array.
I've tried two solution:
1. Using UDF:
import scala.util.Random
def getRandomElement(arr: Array[Int]): Int = {
arr(Random.nextInt(arr.size))
}
val getRandomElementUdf = udf{arr: Array[Int] => getRandomElement(arr)}
sampleDf.withColumn("randomItem", getRandomElementUdf('arrays)).show
crashes on the last line with a long error message: (extracts)
...
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<int>) => int)
...
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
I've tried with the alternative udf definition:
val getRandomElementUdf = udf[Int, Array[Int]] (getRandomElement)
but I get the same error.
2. Second method by creating intermediary columns with a random index in the range of the corresponding array:
// Return a dataframe with a column with random index from column of Arrays with different sizes
def choice(df: DataFrame, colName: String): DataFrame = {
df.withColumn("array_size", size(col(colName)))
.withColumn("random_idx", least('array_size, floor(rand * 'array_size)))
}
choice(sampleDf, "arrays").show
outputs:
+------------+----------+----------+
| arrays|array_size|random_idx|
+------------+----------+----------+
|[15, 16, 17]| 3| 2|
|[15, 16, 17]| 3| 1|
| [14]| 1| 0|
| [11]| 1| 0|
| [11]| 1| 0|
+------------+----------+----------+
and ideally we would like to use the column random_idx to choose an item in column arrays, kind of:
sampleDf.withColumn("choosen_item", 'arrays.getItem('random_idx))
Unfortunaltely, getItem cannot take a column as argument.
Any suggestion is welcome.
You can use the below udf to select the random element from the array as
val getRandomElement = udf ((array: Seq[Integer]) => {
array(Random.nextInt(array.size))
})
df.withColumn("c1", getRandomElement($"arrays"))
.withColumn("c2", getRandomElement($"arrays"))
.withColumn("c3", getRandomElement($"arrays"))
.withColumn("c4", getRandomElement($"arrays"))
.withColumn("c5", getRandomElement($"arrays"))
.show(false)
You can see the random element selected in each use as a new column.
+------------+---+---+---+---+---+
|arrays |c1 |c2 |c3 |c4 |c5 |
+------------+---+---+---+---+---+
|[15, 16, 17]|15 |16 |16 |17 |16 |
|[15, 16, 17]|16 |16 |17 |15 |15 |
|[14] |14 |14 |14 |14 |14 |
|[11] |11 |11 |11 |11 |11 |
|[11] |11 |11 |11 |11 |11 |
+------------+---+---+---+---+---+
If you want to remain udf-free, here is a possibility:
first add a key to the dataframe outputed by choice (assume its name is choiceDf)
val myDf = choiceDf.withColumn("key", monotonically_increasing_id())
then create an intermediary dataframe that explode the arrays column and keep the index of the values
val tmp = myDf.select('key, posexplode('arrays))
finally join using key and random_idx
myDf.join(tmp.withColumnRenamed("pos", "random_idx"), Seq("key", "random_idx", "left")
the item you look for is stored in the column col
+---+----------+------------+----------+---+
|key|random_idx| arrays|array_size|col|
+---+----------+------------+----------+---+
| 0| 2|[15, 16, 17]| 3| 17|
| 1| 1|[15, 16, 17]| 3| 16|
| 2| 0| [14]| 1| 14|
+---+----------+------------+----------+---+
You can extract the random element from an array by one line using Spark-SQL
sampleDF.createOrReplaceTempView("sampleDF")
spark.sql("select arrays[Cast((FLOOR(RAND() * FLOOR(size(arrays)))) as INT)] as random from sampleDF")