Many to Many join with fact and dim table in Spark 3? - scala

I have two data frames one with 10M records and the other with 100K records.
Let's say the schema of the first data frame
create table fact_table
(
id string,
dim_id_list ARRAY<string>
);
where dim_id_list is an array with a length of 1-200.
and the fact table
create table dim_table
(
id string,
tag string
)
where 80K+ ids can have the same tag
example random data
fact_table
id | dim_id_list
----------------------
1 | [a, b]
2 | [a, b, c]
3 | [a, b, c, d]
4 | [a, b, c, d, e]
5 | [a, b, c, d, e, f]
....
dim table
id | tag
----------------------
a | john
b | foo
c | foo
d | foo
f | foo
g | bar
h | random
i | spark
.......
What I want to do is add unique tags to the fact table, as shown in the output
output:
id | dim_id_list | tag
---------------------------------
1 | [a, b] | john
1 | [a, b] | foo
2 | [a, b, c] | john
2 | [a, b, c] | foo
3 | [a, b, c, d] | john
3 | [a, b, c, d] | foo
4 | [a, b, c, d, e] | john
4 | [a, b, c, d, e] | foo
4 | [a, b, c, d, e] | random
5 | [a, b, c, d, e, f] | john
5 | [a, b, c, d, e, f] | foo
5 | [a, b, c, d, e, f] | random
5 | [a, b, c, d, e, f] | bar
.....
basically a left join but with unique tags
the query I wrote for this is
select fact_table.id, dim_id_list, tag
from fact_table
left join (select collect_list(id) as id_list,
tag
from dim_table
group by tag) as dim_table
on arrays_overlap(fact_table.dim_id_list, dim_table.id_list);
spark scala equivalent:
val dimDf_agg = dimDf
.groupBy("tag")
.agg(collect_set("id")
.as("id_list"))
val join_df = fact_df(broadcast(dimDf_agg),
arrays_overlap(fact_df("dim_id_list"),
dimDf_agg("id_list")), "left")
.drop("id_list")
but here it becomes so complex to execute as the complexity of array_overlap is very high.
Is there a more optimized approach that I can follow?
Something like doing broadcast join (left) with the hash of id_list and dim_id_list and just taking the first 'foo' or something else?

What you can do is explode the array (so as to generate one row per item in dim_id_list) and then join over that value. Then you can use distinct to remove the duplicates.
// simply generating the data for reproducibility
val fact_df = Seq(
1 -> Seq("a","b"),
2 -> Seq("a","b","c"),
3 -> Seq("a","b","c","d"),
4 -> Seq("a","b","c","d","e"),
5 -> Seq("a","b","c","d","e","f")
).toDF("id", "dim_id_list")
val dim_df = Seq(
"a" -> "john",
"b" -> "foo",
"c" -> "foo",
"d" -> "foo",
"f" -> "foo",
"g" -> "bar",
"h" -> "random",
"i" -> "spark"
).toDF("id", "tag")
And the solution:
fact_df
.withColumn("id_tag", explode('dim_id_list))
.join(dim_df.select('id as "id_tag", 'tag), Seq("id_tag"))
.drop("id_tag")
.distinct
.orderBy("id")
.show
+---+------------------+----+
| id| dim_id_list| tag|
+---+------------------+----+
| 1| [a, b]|john|
| 1| [a, b]| foo|
| 2| [a, b, c]|john|
| 2| [a, b, c]| foo|
| 3| [a, b, c, d]|john|
| 3| [a, b, c, d]| foo|
| 4| [a, b, c, d, e]| foo|
| 4| [a, b, c, d, e]|john|
| 5|[a, b, c, d, e, f]|john|
| 5|[a, b, c, d, e, f]| foo|
+---+------------------+----+

Related

Explode multiple columns into separate rows in Spark Scala

I have a DF in the following structure
Col1. Col2 Col3
Data1Col1,Data2Col1. Data1Col2,Data2Col2. Data1Col3,Data2Col3
I want the resultant dataset to be of the following type:
Col1 Col2 Col3
Data1Col1. Data1Col2. Data1Col3
Data2Col1. Data2Col2 Data2Col3
Please suggest me how to approach this. I have tried explode , but that results in duplicate rows.
val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3")
df.show
+-------+-------+-------+
| Col1| Col2| Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
res1.show
+------------+------------+------------+
| Col1| Col2| Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
| Col1| Col2| Col3| test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
| C| M| K|
| D| N| P|
| E| O| B|
| F| P| P|
+---+---+---+
res1 - Initial Dataframe
res14 - intermediate Df

How to filter spark dataframe entries based on a column value which is a map

I have a dataframe like this
+-------+------------------------+
|key | data|
+-------+------------------------+
| 61|[a -> b, c -> d, e -> f]|
| 71|[a -> 1, c -> d, e -> f]|
| 81|[c -> d, e -> f] |
| 91|[x -> b, y -> d, e -> f]|
| 11|[a -> a, c -> b, e -> f]|
| 21|[a -> a, c -> x, e -> f]|
+-------+------------------------+
I want to filter rows whose data column map contains the key 'a' and the value of key 'a' is 'a'. So the following dataframe is the desired output.
+-------+------------------------+
|key | data|
+-------+------------------------+
| 11|[a -> a, c -> b, e -> f]|
| 21|[a -> a, c -> x, e -> f]|
+-------+------------------------+
I tried casting the value to a map but I am getting this error
== SQL ==
Map
^^^
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1673)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1651)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1651)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:49)
at org.apache.spark.sql.catalyst.parser.SqlBaseParser$PrimitiveDataTypeContext.accept(SqlBaseParser.java:13779)
at org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:55)
at org.apache.spark.sql.catalyst.parser.AstBuilder.org$apache$spark$sql$catalyst$parser$AstBuilder$$visitSparkDataType(AstBuilder.scala:1645)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleDataType$1.apply(AstBuilder.scala:90)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleDataType$1.apply(AstBuilder.scala:90)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleDataType(AstBuilder.scala:89)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parseDataType$1.apply(ParseDriver.scala:40)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parseDataType$1.apply(ParseDriver.scala:39)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:98)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:39)
at org.apache.spark.sql.Column.cast(Column.scala:1017)
... 49 elided
If I just want to filter based on the column 'key' I can just go by doing df.filter(col("key") === 61). But the problem is, the value is a Map.
Is there any thing like df.filter(col("data").toMap.contains("a") && col("data").toMap.get("a") === "a")
You can filter like this df.filter(col("data.x") === "a") where x is the nested column inside data.

Scala Spark: How to pad a sublist inside a dataframe with extra values?

Say I have a dataframe, originalDF, which looks like this
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [a, b, d] |
| 2|[c, a, b, e] |
| 1| [g] |
+--------+--------------+
And I have another dataframe, extraInfoDF, which looks like this:
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [q, w, x, a] |
| 2|[r, q, l, p] |
| 1| [z, k, j, f] |
+--------+--------------+
For the two data_lists in originalDF that are shorter than 4, I want to add in data from the corresponding data_lists in extraInfoDF so that each list has a length of 4.
The resulting dataframe would look like:
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [a, b, d, q] |
| 2|[c, a, b, e] |
| 1|[g, z, k, j] |
+--------+--------------+
I was trying to find some way to iterate through each row in the dataframe and append to the list that way but was having trouble. Now I'm wondering if there is a simpler way to accomplish this with a UDF?
You can append the second list to the first and take the left-most N elements in a UDF, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
def padList(n: Int) = udf{ (l1: Seq[String], l2: Seq[String]) =>
(l1 ++ l2).take(n)
}
val df1 = Seq(
(3, Seq("a", "b", "d")),
(2, Seq("c", "a", "b", "e")),
(1, Seq("g"))
).toDF("data_id", "data_list")
val df2 = Seq(
(3, Seq("q", "w", "x", "a")),
(2, Seq("r", "q", "l", "p")),
(1, Seq("z", "k", "j", "f"))
).toDF("data_id", "data_list")
df1.
join(df2, "data_id").
select($"data_id", padList(4)(df1("data_list"), df2("data_list")).as("data_list")).
show
// +-------+------------+
// |data_id| data_list|
// +-------+------------+
// | 3|[a, b, d, q]|
// | 2|[c, a, b, e]|
// | 1|[g, z, k, j]|
// +-------+------------+

Update values in a column based on values of another data frame's column values in PySpark

I have two data frames in PySpark: df1
+---+-----------------+
|id1| items1|
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
and df2:
+---+-----------------+
|id2| items2|
+---+-----------------+
|001| [B]|
|002| [A]|
|003| [C]|
|004| [E]|
+---+-----------------+
I would like to create a new column in df1 that would update values in
items1 column, so that it only keeps values that also appear (in any row of) items2 in df2. The result should look as follows:
+---+-----------------+----------------------+
|id1| items1| items1_updated|
+---+-----------------+----------------------+
| 0| [B, C, D, E]| [B, C, E]|
| 1| [E, A, C]| [E, A, C]|
| 2| [F, A, E, B]| [A, E, B]|
| 3| [E, G, A]| [E, A]|
| 4| [A, C, E, B, D]| [A, C, E, B]|
+---+-----------------+----------------------+
I would normally use collect() to get a list of all values in items2 column and then use a udf applied to each row in items1 to get an intersection. But the data is extremely large (over 10 million rows) and I cannot use collect() to get such list. Is there a way to do this while keeping data in a data frame format? Or some other way without using collect()?
The first thing you want to do is explode the values in df2.items2 so that contents of the arrays will be on separate rows:
from pyspark.sql.functions import explode
df2 = df2.select(explode("items2").alias("items2"))
df2.show()
#+------+
#|items2|
#+------+
#| B|
#| A|
#| C|
#| E|
#+------+
(This assumes that the values in df2.items2 are distinct- if not, you would need to add df2 = df2.distinct().)
Option 1: Use crossJoin:
Now you can crossJoin the new df2 back to df1 and keep only the rows where df1.items1 contains an element in df2.items2. We can achieve this using pyspark.sql.functions.array_contains and this trick that allows us to use a column value as a parameter.
After filtering, group by id1 and items1 and aggregate using pyspark.sql.functions.collect_list
from pyspark.sql.functions import expr, collect_list
df1.alias("l").crossJoin(df2.alias("r"))\
.where(expr("array_contains(l.items1, r.items2)"))\
.groupBy("l.id1", "l.items1")\
.agg(collect_list("r.items2").alias("items1_updated"))\
.show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 1| [E, A, C]| [A, C, E]|
#| 0| [B, C, D, E]| [B, C, E]|
#| 4|[A, C, E, B, D]| [B, A, C, E]|
#| 3| [E, G, A]| [A, E]|
#| 2| [F, A, E, B]| [B, A, E]|
#+---+---------------+--------------+
Option 2: Explode df1.items1 and left join:
Another option is to explode the contents of items1 in df1 and do a left join. After the join, we have to do a similar group by and aggregation as above. This works because collect_list will ignore the null values introduced by the non-matching rows
df1.withColumn("items1", explode("items1")).alias("l")\
.join(df2.alias("r"), on=expr("l.items1=r.items2"), how="left")\
.groupBy("l.id1")\
.agg(
collect_list("l.items1").alias("items1"),
collect_list("r.items2").alias("items1_updated")
).show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 0| [E, B, D, C]| [E, B, C]|
#| 1| [E, C, A]| [E, C, A]|
#| 3| [E, A, G]| [E, A]|
#| 2| [F, E, B, A]| [E, B, A]|
#| 4|[E, B, D, C, A]| [E, B, C, A]|
#+---+---------------+--------------+

Convert Array[Row] which has List of Lists to Dataframe in Scala Spark

I am new to Scala in Spark.
I have a input dataframe with one column.
Each element in the dataframe is List of Lists which I need to convert to Dataframe.
def functionName(x: DataFrame){
//CODE TO DO ON DATAFRAME
}
> inputdf.show()
+------------------------+
| col |
+------------------------+
| [[a, b, c], [d, e, f]] |
| [[g, h, i], [j, k, l]] |
| [[m, n, o], [p, q, r]] |
| [[s, t, u], [v, w, x]] |
+------------------------+
To convert each row to a dataframe I have used:
> inputdf.rdd.map(row => functionName(row.toDF()))
> inputdf.rdd.map(row => functionName(sqlContext.createDataFrame(sc.parallelize(row))))
> inputdf.rdd.map(row => functionName(sqlContext.createDataFrame(sc.parallelize(Seq(row)))))
Have tried most of the methods suggested on stackoverflow but none of them. Can some suggest me how to convert each row in the inputdf using map function? Thanks in advance.