Create a complex dataframe on (py)spark - pyspark

I want to know how I can create the following “complex” json structure in (py)Spark (2.3.2):
Test data set:
df = sc.parallelize([
[1, 'a', 'x1'],
[1, 'a', 'x2'],
[2, 'b', 'y1'],
[2, 'b', 'y2'],
[3, 'c', 'z'],
]).toDF('id: integer, field1: string, field2: string').cache()
Code:
import pyspark.sql.functions as F
out = df.groupBy('id').agg(F.to_json(F.create_map(
F.lit('key1'),
F.col('field1'),
F.lit('info'),
F.collect_list(F.create_map(
F.lit('number'),
F.col('id'),
F.lit('key2'),
F.col('field2')
))
)).alias('info'))
What my target json structure is, is that I have a data set like this:
[
(id) 1, (info) {“key1”: ‘a’, “info”: [{“number”: 1, “key2”: “x1”}, {“number”: 1, “key2”: “x1”}],
(id) 2, (info) {“key1”: ‘b’, “info”: [{“number”: 2, “key2”: “y1”}, {“number”: 1, “key2”: “x2”}],
(id) 3, (info) {“key1”: ‘c’, “info”: [{“number”: 3, “key2”: “z”}]
]
How could I achieve this? (Can I achieve this?)
As I'm always getting the following error:
org.apache.spark.sql.AnalysisException:
cannot resolve 'map('key1', `field1`, 'info', collect_list(map('number',
CAST(`id` AS STRING), 'key2', CAST(`field2` AS STRING))))'
due to data type mismatch: The given values of function map should all be the same type,
but they are [string, array<map<string,string>>]
What I understand from this error is that field1 is a string, and the value of 'info' is not. But that's the way I want it to be...
So, could I achieve this another way?
Thanks!

I found one (hackish) way to do things... I don't like it very much but seeing that no-one in this community posted an answer, I am starting to think it's not that easy.
So first of all, I split the "big" aggregation in 2:
out = df.groupBy('id', 'field1').agg(F.to_json(F.create_map(
F.lit('key1'),
F.col('field1'),
F.lit('info'),
F.lit('%%replace%%')
)).alias('first'), F.to_json( F.collect_list(F.create_map(
F.lit('number'),
F.col('id'),
F.lit('key2'),
F.col('field2')
))
).alias('second'))
This will generate the following table:
+---+------+---------------------------------+-------------------------------------------------------+
|id |field1|first |second |
+---+------+---------------------------------+-------------------------------------------------------+
|3 |c |{"key1":"c","info":"%%replace%%"}|[{"number":"3","key2":"z"}] |
|2 |b |{"key1":"b","info":"%%replace%%"}|[{"number":"2","key2":"y1"},{"number":"2","key2":"y2"}]|
|1 |a |{"key1":"a","info":"%%replace%%"}|[{"number":"1","key2":"x1"},{"number":"1","key2":"x2"}]|
+---+------+---------------------------------+-------------------------------------------------------+
And now you combine them together:
df2 = out.withColumn('final', F.expr('REPLACE(first, \'"%%replace%%"\', second)')).drop('first').drop('second')
df2.show(10, False)
+---+------+---------------------------------------------------------------------------+
|id |field1|final |
+---+------+---------------------------------------------------------------------------+
|3 |c |{"key1":"c","info":[{"number":"3","key2":"z"}]} |
|2 |b |{"key1":"b","info":[{"number":"2","key2":"y1"},{"number":"2","key2":"y2"}]}|
|1 |a |{"key1":"a","info":[{"number":"1","key2":"x1"},{"number":"1","key2":"x2"}]}|
+---+------+---------------------------------------------------------------------------+
A bit unorthodox, but no complaints from Spark :)

Related

Spark: convert array of arrays to array of strings?

I have column with type Array of Arrays I need to get column array of string.
+--------------------------+
|field |
+--------------------------+
|[[1, 2, 3], [1, 2, 3], []]|
+--------------------------+
I need to get:
+--------------------------+
|field |
+--------------------------+
|["123", "123", ""] |
+--------------------------+
Can I do this in Spark without using UDF?
You can use transform higher order function,
import spark.implicits._
val df = Seq(Seq(Seq(1,2,3), Seq(1,2,3), Seq())).toDF("field")
df.withColumn("field", expr("transform(field, v->concat_ws('',v))"))
.show
+------------+
| field|
+------------+
|[123, 123, ]|
+------------+

How to get all combinations of an array column in Spark?

Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?
Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+
Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))

Spark get all rows with same values in array in column

I have a Spark Dataframe with columns id and hashes, where the column hashes contains a Seq of integer values of length n. Example:
+----+--------------------+
+ id| hashes|
+----+--------------------+
|0 | [1, 2, 3, 4, 5]|
|1 | [1, 5, 3, 7, 9]|
|2 | [9, 3, 6, 8, 0]|
+-------------------------+
I want to get a dataframe with all the rows for which the arrays in hashes match in at least one position. More formally, I want a dataframe with an additional column matches that for each row r contains a Seq of ids of rows where hashes[r][i] == hashes[k][i] with k being any other row for at leas one value of i.
For my example data, the result would be:
+---+---------------+-------+
|id |hashes |matches|
+---+---------------+-------+
|0 |[1, 2, 3, 4, 5]|[1] |
|1 |[1, 5, 3, 7, 9]|[0] |
|2 |[9, 3, 6, 8, 0]|[] |
+---+---------------+-------+
In Spark 3, the following code compares arrays between rows, keeping only rows where the two arrays share at least one element at the same position. df is your input dataframe:
df.join(
df.withColumnRenamed("id", "id2").withColumnRenamed("hashes", "hashes2"),
exists(arrays_zip(col("hashes"), col("hashes2")), x => x("hashes") === x("hashes2"))
)
.groupBy("id")
.agg(first(col("hashes")).as("hashes"), collect_list("id2").as("matched"))
.withColumn("matched", filter(col("matched"), x => x.notEqual(col("id"))))
Detailed description
First, we perform an auto cross join, filtered by your condition of at least one element in same position on the two hashes arrays.
To build the condition, we zip the two hashes arrays, one from first dataframe, one for the second joined dataframe, that is just the first dataframe with columns renamed. By zipping, we get an array of {"hashes":x, "hashes2":y} and next we just need to check that in this array exists an element where x = y. The complete condition is written as follow:
exists(arrays_zip(col("hashes"), col("hashes2")), x => x("hashes") === x("hashes2"))
Then, we will aggregate by column id to collect all id2 of rows that were kept, meaning rows that matching your condition
to keep the "hashes" column, as for two rows with the same "id" the column "hashes" are equals, we get the first occurrence of "hashes" for each "id". And we collect all the "id2" using collect_list:
.agg(first(col("hashes")).as("hashes"), collect_list("id2").as("matches"))
And finally, we filter out from column "matches" the id of the current row
.withColumn("matches", filter(col("matches"), x => x.notEqual(col("id"))))
if you need the "id" to be in order, you can add an orderBy clause:
.orderBy("id")
Run
With a dataframe df containing the following values:
+---+---------------+
|id |hashes |
+---+---------------+
|0 |[1, 2, 3, 4, 5]|
|1 |[1, 5, 3, 7, 9]|
|2 |[9, 3, 6, 8, 0]|
+---+---------------+
You get the following output:
+---+---------------+-------+
|id |hashes |matches|
+---+---------------+-------+
|0 |[1, 2, 3, 4, 5]|[1] |
|1 |[1, 5, 3, 7, 9]|[0] |
|2 |[9, 3, 6, 8, 0]|[] |
+---+---------------+-------+
Limits
The join is a cartesian product, which is very expensive. Although the condition filters results, it can lead to an huge amount of calculation/shuffle on big datasets, and may have very poor performance.
If you use Spark whose version is before 3.0, you have to replace some build-in spark functions by user-defined functions

Scala Transformation and action

I have an RDD List[(String, List[Int])] like List(("A",List(1,2,3,4)),("B",List(5,6,7)))
How to transform them to List(("A",1),("A",2),("A",3),("A",4),("B",5),("B",6),("B",7))
Then action would be reducing by key and generating result like List(("A",2.5)("B",6))
I have tried using map(e=>List(e._1,e._2)) but its not giving desired result.
Where 2.5 is average for "A" and 6 is average for "B"
Help me with these set of transformation and actions.
Thanks in advance
There are several ways to get what you want. You could use a for comprehension as well, but the very first one came up to my mind is this implementation:
val l = List(("A", List(1, 2, 3)), ("B", List(1, 2, 3)))
val flattenList = l.flatMap {
case (elem, _elemList) =>
_elemList.map((elem, _))
}
Output:
List((A,1), (A,2), (A,3), (B,1), (B,2), (B,3))
If what you want is the average of each list in the end, then it's not necessary to break them up into individual elements with a flatMap. Doing so with a large list would unnecessarily shuffle a lot of data with a large data set.
Since they are already aggregated by key, just transform them with something like this:
val l = spark.sparkContext.parallelize(Seq(
("A", List(1, 2, 3, 4)),
("B", List(5, 6, 7))
))
val avg = l.map(r => {
(r._1, (r._2.sum.toDouble / r._2.length.toDouble))
})
avg.collect.foreach(println)
Bear in mind that this will fail if any of your lists are 0 length. If you have some 0 length lists, you'll have to put a check condition in the map.
The above code gives you:
(A,2.5)
(B,6.0)
You can try explode()
scala> val df = List(("A",List(1,2,3,4)),("B",List(5,6,7))).toDF("x","y")
df: org.apache.spark.sql.DataFrame = [x: string, y: array<int>]
scala> df.withColumn("z",explode('y)).show(false)
+---+------------+---+
|x |y |z |
+---+------------+---+
|A |[1, 2, 3, 4]|1 |
|A |[1, 2, 3, 4]|2 |
|A |[1, 2, 3, 4]|3 |
|A |[1, 2, 3, 4]|4 |
|B |[5, 6, 7] |5 |
|B |[5, 6, 7] |6 |
|B |[5, 6, 7] |7 |
+---+------------+---+
scala> val df2 = df.withColumn("z",explode('y))
df2: org.apache.spark.sql.DataFrame = [x: string, y: array<int> ... 1 more field]
scala> df2.groupBy("x").agg(sum('z)/count('z) ).show(false)
+---+-------------------+
|x |(sum(z) / count(z))|
+---+-------------------+
|B |6.0 |
|A |2.5 |
+---+-------------------+
scala>

How to delete duplicated pairs of nodes in Spark?

I have the following DataFrame in Spark:
nodeFrom nodeTo value date
1 2 11 2016-10-12T12:10:00.000Z
1 2 12 2016-10-12T12:11:00.000Z
1 2 11 2016-10-12T12:09:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
5 3 11 2016-10-12T14:00:00.000Z
I need to delete duplicated pairs of nodeFrom and nodeTo, while taking the earliest and latest date and the average of corresponding value values.
The expected output is the following one:
nodeFrom nodeTo value date
1 2 11.5 [2016-10-12T12:09:00.000Z,2016-10-12T12:11:00.000Z]
4 2 34 [2016-10-12T14:00:00.000Z]
5 3 11 [2016-10-12T14:00:00.000Z]
Using the struct function with min and max, only a single groupBy and agg step is necessary.
Assuming that this is your data:
val data = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
data.show()
You can get the average and the array with earliest/latest date as follows:
import org.apache.spark.sql.functions._
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
array($"date1.date", $"date2.date") as 'date
)
.show(60, false)
This will give you almost what you want, with the minor difference every array of dates has size 2:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
+--------+------+-----+----------------------------------------------------+
If you really (really?) want to eliminate the duplicates from the array column, it seems that the easiest way is to use a custom udf for that:
val elimDuplicates = udf((_: collection.mutable.WrappedArray[String]).distinct)
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
elimDuplicates(array($"date1.date", $"date2.date")) as 'date
)
.show(60, false)
This will produce:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
+--------+------+-----+----------------------------------------------------+
Brief explanation:
min(struct('date, 'value)) as date1 selects the earliest date together with the corresponding value
Same with max
The average is computed directly from these two tuples by summing and dividing by 2
The corresponding values are written to array column
(optional) the array is de-duplicated
Hope that helps.
You could do a normal groupBy and then use a udf to make date Columns as desired like below:
val df = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
def zipDates = udf((date1: String, date2: String) => {
if (date1 == date2)
Seq(date1)
else
Seq(date1, date2)
})
val dfT = df
.groupBy('nodeFrom, 'nodeTo)
.agg(avg('value) as "value", min('date) as "minDate", max('date) as "maxDate")
.select('nodeFrom, 'nodeTo, 'value, zipDates('minDate, 'maxDate) as "date")
dfT.show(10, false)
// +--------+------+------------------+----------------------------------------------------+
// |nodeFrom|nodeTo|value |date |
// +--------+------+------------------+----------------------------------------------------+
// |1 |2 |11.333333333333334|[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
// |5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
// |4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
// +--------+------+------------------+----------------------------------------------------+