Sample values from a list colum in spark dataframe - scala

I have a spark-scala dataframe as shown in df1 below: I would like to sample with replacement from scores column(a List), based on counts in another column of df1.
val df1 = sc.parallelize(Seq(("a1",2,List(20,10)),("a2",1,List(30,10)),
("a3",3,List(10)),("a4",2,List(10,20,40)))).toDF("colA","counts","scores")
df1.show()
+----+------+------------+
|colA|counts| scores|
+----+------+------------+
| a1| 2| [20, 10]|
| a2| 1| [30, 10]|
| a3| 3| [10]|
| a4| 2|[10, 20, 40]|
+----+------+------------+
Expected output is shown in df2: from row 1, sample 2 values from list [20,10]; from row 2 sample 1 value from list [30,10]; from row 3 sample 3 values from list[10] with repetition.. etc.
df2.show() //expected output
+----+------+------------+-------------+
|colA|counts| scores|sampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 10]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [10, 40]|
+----+------+------------+-------------+
I wrote an udf 'takeSample' and applied to df1 but did not work as intended.
val takeSample = udf((a:Array[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
I got the following run-time error; when executing
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show()
org.apache.spark.SparkException: Failed to execute user defined
function($anonfun$1: (array<int>, int) => array<int>)
Caused by: java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
at $anonfun$1.apply(<console>:47)
Any solution is greatly appreciated.

Changing the data type from Array[Int] to Seq[Int] in the UDF will resolve the issue:
val takeSample = udf((a:Seq[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
It will give us the expected output:
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show
+----+------+------------+-------------+
|colA|counts| scores|SampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 20]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [20, 20]|
+----+------+------------+-------------+

Related

flattern scala array data type column to multiple columns

Is their any possible way to flatten an array in Scala DF?
As I know with columns and select filed.a works, but I don't want to specify them Manually.
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
final df
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed_a: string (nullable = true)
|-- filed_b: string (nullable = true)
|-- filed_c: string (nullable = true)
|-- filed_d: string (nullable = true)
You can flatten your ArrayType column with explode and map the nested struct element names to the wanted top-level column names, as shown below:
import org.apache.spark.sql.functions._
case class S(a: String, b: String, c: String, d: String)
val df = Seq(
("1.0", Seq(S("a1", "b1", "c1", "d1"))),
("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
).toDF("client_version", "filed")
df.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a: string (nullable = true)
// | | |-- b: string (nullable = true)
// | | |-- c: string (nullable = true)
// | | |-- d: string (nullable = true)
val dfFlattened = df.withColumn("filed_element", explode($"filed"))
val structElements = dfFlattened.select($"filed_element.*").columns
val dfResult = dfFlattened.select( col("client_version") +: structElements.map(
c => col(s"filed_element.$c").as(s"filed_$c")
): _*
)
dfResult.show
// +--------------+-------+-------+-------+-------+
// |client_version|filed_a|filed_b|filed_c|filed_d|
// +--------------+-------+-------+-------+-------+
// | 1.0| a1| b1| c1| d1|
// | 2.0| a2| b2| c2| d2|
// | 2.0| a3| b3| c3| d3|
// +--------------+-------+-------+-------+-------+
dfResult.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed_a: string (nullable = true)
// |-- filed_b: string (nullable = true)
// |-- filed_c: string (nullable = true)
// |-- filed_d: string (nullable = true)
Use explode to flatten the arrays by adding more rows and then select with the * notation to bring the struct columns back to the top.
import org.apache.spark.sql.functions.{collect_list, explode, struct}
import spark.implicits._
val df = Seq(("1", "a", "a", "a"),
("1", "b", "b", "b"),
("2", "a", "a", "a"),
("2", "b", "b", "b"),
("2", "c", "c", "c"),
("3", "a", "a","a")).toDF("idx", "A", "B", "C")
.groupBy(("idx"))
.agg(collect_list(struct("A", "B", "C")).as("nested_col"))
df.printSchema()
// root
// |-- idx: string (nullable = true)
// |-- nested_col: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- A: string (nullable = true)
// | | |-- B: string (nullable = true)
// | | |-- C: string (nullable = true)
df.show
// +---+--------------------+
// |idx| nested_col|
// +---+--------------------+
// | 3| [[a, a, a]]|
// | 1|[[a, a, a], [b, b...|
// | 2|[[a, a, a], [b, b...|
// +---+--------------------+
val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")
dfExploded.show
// +---+---------+
// |idx| exploded|
// +---+---------+
// | 3|[a, a, a]|
// | 1|[a, a, a]|
// | 1|[b, b, b]|
// | 2|[a, a, a]|
// | 2|[b, b, b]|
// | 2|[c, c, c]|
// +---+---------+
val finalDF = dfExploded.select("idx", "exploded.*")
finalDF.show
// +---+---+---+---+
// |idx| A| B| C|
// +---+---+---+---+
// | 3| a| a| a|
// | 1| a| a| a|
// | 1| b| b| b|
// | 2| a| a| a|
// | 2| b| b| b|
// | 2| c| c| c|
// +---+---+---+---+

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
what I finally want is something like this:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
I have written the following code till now:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.
Ok. The solution is only tactical for your case. The below one worked for me
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
Output:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

Spark: How to split struct type into multiple columns?

I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.

Spark GroupBy agg collect_list multiple columns

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?
You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

Is this a valid Spark Schema?

I was in the process of flattening a Spark Schema using the method suggested here, when I came across an edge case -
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
)))
))
writerSchema.printTreeString()
root
|-- f1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- f2: array (nullable = true)
| | | | |-- element: long (containsNull = true)
This prints the following output - f1 and not
f1
f1.f2
as I expected it to be.
Questions -
Is writerSchema a valid Spark schema?
How do I handle ArrayType objects when flattening the schema?
If you want to handle data like this
val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""
The valid schema will be
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
))))
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
You shouldn't be putting an ArrayType inside another ArrayType.
So lets suppose you have a dataframe inputDF :
inputDF.printSchema
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1 |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+
To flatten this dataframe we can explode the array columns (f1 and f2):
First, flatten column 'f1'
val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))
semiFlattenDF.printSchema
root
|-- f2: array (nullable = true)
| |-- element: long (containsNull = true)
semiFlattenDF.show
+------------+
| f2|
+------------+
| [1, 2, 3]|
| [4, 5, 6]|
| [7, 8, 9]|
|[10, 11, 12]|
+------------+
Now flatten column 'f2' and get the column name as 'value'
val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))
So now the DataFrame is flattened:
fullyFlattenDF.printSchema
root
|-- value: long (nullable = true)
fullyFlattenDF.show
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
+-----+