How to get all combinations of an array column in Spark? - scala

Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?

Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+

You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+

Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))

Related

Spark: convert array of arrays to array of strings?

I have column with type Array of Arrays I need to get column array of string.
+--------------------------+
|field |
+--------------------------+
|[[1, 2, 3], [1, 2, 3], []]|
+--------------------------+
I need to get:
+--------------------------+
|field |
+--------------------------+
|["123", "123", ""] |
+--------------------------+
Can I do this in Spark without using UDF?
You can use transform higher order function,
import spark.implicits._
val df = Seq(Seq(Seq(1,2,3), Seq(1,2,3), Seq())).toDF("field")
df.withColumn("field", expr("transform(field, v->concat_ws('',v))"))
.show
+------------+
| field|
+------------+
|[123, 123, ]|
+------------+

Pyspark group by collect list, to_json and pivot

Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big
There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+

Scala - Calculate using last elements for the arrays

I have a dataframe with below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
Dataframe output
+----+-------+------------------------------------------+
|ID |Amt |col |
+----+-------+------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5,40000]] |
+------------+------------------------------------------+
I need to to the following calculation
Last element of the first arrary will be same 10000.
For the next array I need to minus it with the value from first (40000-10000) and get output as 30000
Expected output
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5, 30000]] |
+----+-------+-------------------------------------------+
How would I achieve this?
You can use transform and compare the Amt with the previous entry:
val df2 = df.withColumn(
"col",
expr("""
transform(
col,
(x, i) -> struct(
x.Seq as Seq, x.Pct as Pct, x.Sh as Sh,
case when i=0 then x.Amt else x.Amt - col[i-1].Amt end as Amt
)
)
""")
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 10000], [2, 1.2, 2.5, 30000]]|
+-----+---+--------------------------------------------+

Scala Transformation and action

I have an RDD List[(String, List[Int])] like List(("A",List(1,2,3,4)),("B",List(5,6,7)))
How to transform them to List(("A",1),("A",2),("A",3),("A",4),("B",5),("B",6),("B",7))
Then action would be reducing by key and generating result like List(("A",2.5)("B",6))
I have tried using map(e=>List(e._1,e._2)) but its not giving desired result.
Where 2.5 is average for "A" and 6 is average for "B"
Help me with these set of transformation and actions.
Thanks in advance
There are several ways to get what you want. You could use a for comprehension as well, but the very first one came up to my mind is this implementation:
val l = List(("A", List(1, 2, 3)), ("B", List(1, 2, 3)))
val flattenList = l.flatMap {
case (elem, _elemList) =>
_elemList.map((elem, _))
}
Output:
List((A,1), (A,2), (A,3), (B,1), (B,2), (B,3))
If what you want is the average of each list in the end, then it's not necessary to break them up into individual elements with a flatMap. Doing so with a large list would unnecessarily shuffle a lot of data with a large data set.
Since they are already aggregated by key, just transform them with something like this:
val l = spark.sparkContext.parallelize(Seq(
("A", List(1, 2, 3, 4)),
("B", List(5, 6, 7))
))
val avg = l.map(r => {
(r._1, (r._2.sum.toDouble / r._2.length.toDouble))
})
avg.collect.foreach(println)
Bear in mind that this will fail if any of your lists are 0 length. If you have some 0 length lists, you'll have to put a check condition in the map.
The above code gives you:
(A,2.5)
(B,6.0)
You can try explode()
scala> val df = List(("A",List(1,2,3,4)),("B",List(5,6,7))).toDF("x","y")
df: org.apache.spark.sql.DataFrame = [x: string, y: array<int>]
scala> df.withColumn("z",explode('y)).show(false)
+---+------------+---+
|x |y |z |
+---+------------+---+
|A |[1, 2, 3, 4]|1 |
|A |[1, 2, 3, 4]|2 |
|A |[1, 2, 3, 4]|3 |
|A |[1, 2, 3, 4]|4 |
|B |[5, 6, 7] |5 |
|B |[5, 6, 7] |6 |
|B |[5, 6, 7] |7 |
+---+------------+---+
scala> val df2 = df.withColumn("z",explode('y))
df2: org.apache.spark.sql.DataFrame = [x: string, y: array<int> ... 1 more field]
scala> df2.groupBy("x").agg(sum('z)/count('z) ).show(false)
+---+-------------------+
|x |(sum(z) / count(z))|
+---+-------------------+
|B |6.0 |
|A |2.5 |
+---+-------------------+
scala>

spark multiplie value of frequence in an other string column dataframe

I have a dataframe with this shema:
ocId: integer (nullable = true).
freq: integer (nullable = true).
nameFile: string (nullable = true)
word: string (nullable = true).
I want to multiplie value of row freq in column word to have word*freq in a new column
example :
value freq = 2, word = analyses ---->a new column : analyses,analyses.
value freq = 3, word = carried ---->a new column
: carried,carried,carried.
value freq = 1, word =
atlantic---->a new column : atlantic.
value freq = 2, word =
hello---->a new column : hello,hello.
You can define a UDF similar to the following:
val df = Seq(
(1, 2, "a", "analyses"),
(2, 3, "b", "carried"),
(3, 1, "c", "atlantic"),
(4, 2, "d", "hello"),
(5, 2, "e", ""),
(6, 1, "f", null),
(7, 0, "f", "blah")
).toDF("ocId", "freq", "nameFile", "word")
def multiWords = udf(
(word: String, freq: Int) => word match {
case null => null
case "" => ""
case _ => if (freq > 0) ((word + ",") * freq).dropRight(1) + "."
else ""
}
)
df.withColumn("multiWords", multiWords($"word", $"freq")).
show(false)
// +----+----+--------+--------+------------------------+
// |ocId|freq|nameFile|word |multiWords |
// +----+----+--------+--------+------------------------+
// |1 |2 |a |analyses|analyses,analyses. |
// |2 |3 |b |carried |carried,carried,carried.|
// |3 |1 |c |atlantic|atlantic. |
// |4 |2 |d |hello |hello,hello. |
// |5 |2 |e | | |
// |6 |1 |f |null |null |
// |7 |0 |g |blah | |
// +----+----+--------+--------+------------------------+