Handling dataskew without salting the join key in spark - scala

I am trying to inner Join a million rows dataframe with a 30 rows dataframe and both the tables have same join key, spark is trying to perform sort merge join and due to which all my data ends up in the same executor and Job never finishes, for example
DF1(million rows dataframe registered as TempView DF1)
+-------+-----------+
| id | price |
+-------+-----------+
| 1 | 30 |
| 1 | 10 |
| 1 | 12 |
| 1 | 15 |
+-------+-----------+
DF2(30 rows dataframe registered as TempView DF2)
+-------+-----------+
| id | Month |
+-------+-----------+
| 1 | Jan |
| 1 | Feb |
+-------+-----------+
I tried following
Broadcasting
spark.sql("Select /*+ BROADCAST(Df2) */ Df1.* from Df1 inner join Df2 on Df1.id=Df2.id").createTempView("temp")
Repartitioned
Df1.repartition(200)
Query Execution Plan
00 Project [.......................]
01 +- SortMergeJoin [.............................],Inner
02 :- Project [.............................]
03 : +-Filter is notnull[JoinKey]
04 : +- FileScan orc[..........................]
05 +-Project [.............................]
06 +-BroadcastHashJoin [..........................], LeftOuter, BuildRight
07 :- BroadCastHashJoin [......................],LeftSemi, BuildRight
Output of the number of partitions
spark.table("temp").withColumn("partition_id",spark_partition_id).groupBy
("partition_id").count
+-------+---------------+
| 21 |300,00,000 |
+-------+---------------+
Even though i re-partition/broadcast the data, spark is bringing all the data to one executor while joining and data gets skewed at one executor. I also tried turning off the spark.sql.join.preferSortMergeJoin to false. But i still see my data getting skewed at one executor. Can anyone help me ?

Just doing it like this, it works fine. Data is as is, no partitioning as such.
import org.apache.spark.sql.functions.broadcast
// Simulate some data
val df1 = spark.range(1000000).rdd.map(x => (1, "xxx")).toDF("one", "val")
val df2 = spark.range(30).rdd.map(x => (1, "yyy")).toDF("one", "val2")
// Data is as is, has no partitioning applied
val df3 = df1.join(broadcast(df2), "one")
df3.count // An action to kick it all along
// Look at final counts of partitions
val rddcounts = df3.rdd.mapPartitions(iter => Array(iter.size).iterator, true)
rddcounts.collect
returns:
res26: Array[Int] = Array(3750000, 3750000, 3750000, 3750000, 3750000, 3750000, 3750000, 3750000)
This relies on default parallelism, 8 on a CE Databricks cluster.
Broadcast should work in any event as the small table is SMALL.
Even with this:
val df = spark.range(1000000).rdd.map(x => (1, "xxx")).toDF("one", "val")
val df1 = df.repartition(50)
It works in parallel with 50 partitions. This is round-robin partitioning meaning the cluster will get partitions distributed over N Workers with at least N Executors. It is not hashed, the hash is invoked by specifying a column causing skewness if all values same. I.e. the same partition on 1 Worker for all the data.
QED: So, not all working on only one Executor, unless you have only one Executor for the Spark App or hashing applied.
I ran afterwards on my experimental laptop with local[4] and the data was being serviced by 4 cores, thus 4 Executors as it were. No salting, parallel 4. So, it is odd you cannot get that, unless you hashed.
You can see 4 parallel Tasks and thus not all on 1 Executor if on a real cluster.

why dose all data move to one executor ? If you only have same
id(id:1) in DF1 and use id to join DF2 . according the
HashPartitioner the data with id=1 will always move together .
have you occur Broadcast join ? check it in spark UI

Related

spark merge datasets based on the same input of one column and concat the others

Currently I have several Dataset[UserRecord], and it looks like this
case class UserRecord(
Id: String,
ts: Timestamp,
detail: String
)
Let's call the several datasets datasets.
Previously I tried this
datasets.reduce(_ union _)
.groupBy("Id")
.agg(collect_list("ts", "detail"))
.as[(String, Seq[DetailRecord]]
but this code gives me an OOM error. I think the root cause is collect_list.
Now I'm thinking if I can do the groupBy and agg for each of the dataset first and then join them together to solve the OOM issue. Any other good advice is welcome too :)
I have an IndexedSeq of datasets look like this
|name| lists |
| x |[[1,2], [3,4]]|
|name| lists |
| y |[[5,6], [7,8]]|
|name| lists |
| x |[[9,10], [11,12]]|
How can I combine them to get a Dataset that looks like
|name| lists |
| x |[[1,2], [3,4],[9,10], [11,12]]|
| y |[[5,6], [7,8]] |
I tried ds.reduce(_ union _) but it didn't seem to work
You can aggregate after union:
val ds2 = ds.reduce(_ unionAll _).groupBy("name").agg(flatten(collect_list("lists")).as("lists"))

How to get a new dataframe which only 10 lines per user_id(column name)? [duplicate]

This question already has an answer here:
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 4 years ago.
I have a dataframe that looks like:
scala> df.show()
+-------+-------+
|user_id|book_id|
+-------+-------+
| 235610|2757548|
| 235610|2352922|
| 235610| 620968|
| 235610|1037143|
| 235610|2319578|
| ... | .... |
| 235610|1037143|
| 235610|2319578|
and it`s have three different users in the "user_id" column as follows :
scala> val df1 = df.select("user_id").distinct()
scala> df1.show()
+-------+
|user_id|
+-------+
| 235610|
| 211065|
| 211050|
+-------+
Number of lines per user("235610","211065","21050") as follows:
scala> df.filter($"user_id"==="235610").count()
res28: Long = 140
scala> df.filter($"user_id"==="211065").count()
res29: Long = 51
scala> df.filter($"user_id"==="211050").count()
res30: Long = 64
Now my problem is how to get a new dataframe which only 10 lines per user_id? Because every user_id("235610","211065","21050") is over 10 records per user.
Spark version 2.3.0. Any help will be appreciated.
Your spark version is 1.4, the rank works with hive context.
so register your df on hiveContext:
df.registerTempTable("tempDF")
val dfRanked = hiveContext.sql("select dataWithRank.*,
dense_rank() OVER
( PARTITION BY dataWithRank.book_id ORDER BY dataWithRank.book_id DESC) AS Rank
from tempDF as dataWithRank)
dfRanked.filter("Rank>10")
here documentation about hive rank:
http://www.openkb.info/2016/02/difference-between-spark-hivecontext.html
You can try using rank function over partition by user_id and order by book_id.
Based on rank you can filter where rank >=10 to get 10 records per user_id.
I hope it helps.

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

spark structured streaming joining aggregate dataframe to dataframe

I have a streaming dataframe that could look at some point like:
+--------------------+--------------------+
| owner| fruits|
+--------------------+--------------------+
|Brian | apple|
Brian | pear |
Brian | date|
Brian | avocado|
Bob | avocado|
Bob | apple|
........
+--------------------+--------------------+
I performed a groupBy, agg collect_list to clean things up.
val myFarmDF = farmDF.withWatermark("timeStamp", "1 seconds").groupBy("fruits").agg(collect_list(col("fruits")) as "fruitsA")
the output is a single row for each owner and an array of every fruit.
I would now like to join this cleaned up array to the original streaming dataframe dropping the fruits col and just having the fruitsA column
val joinedDF = farmDF.join(myFarmDF, "owner").drop("fruits")
this seems to work in my head, but spark doesn't seem to agree.
I get a
Failure when resolving conflicting references in Join:
'Join Inner
...
+- AnalysisBarrier
+- Aggregate [name#17], [name#17, collect_list(fruits#61, 0, 0) AS fruitA#142]
When I turn everything into a static dataframe, it works just fine. Is this not possible in a streaming context?
Have you tried renaming the column name? There is a similar problem https://issues.apache.org/jira/browse/SPARK-19860

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).