Pyspark intersection of two dataframe columns - pyspark

I have two dataframes in the following format:
one is single row:
+-----+--------------------+
| col1| col2|
+-----+--------------------+
| A | [B, C, D] |
+-----+--------------------+
The other one has multiple rows:
+----------+--------------------+
| col1| col2|
+----------+--------------------+
| F |[A, B, C] |
| G |[J, K, B] |
| H |[C, H, D] |
+----------+--------------------+
I am looking for the intersection of these two:
+----------+--------------------+
| col1| col2|
+----------+--------------------+
| F |[B, C] |
| G |[B] |
| H |[C, D] |
+----------+--------------------+
I tried the solution proposed here but it didn't help. Is there any efficient way to find the intersection between one row dataframe and multiple rows dataframe?

You can crossJoin the col2 from the single row dataframe and use array_intersect function for the required intersection.
data2_sdf. \
crossJoin(func.broadcast(data1_sdf.selectExpr('col2 as col_to_check'))). \
withColumn('reqd_intersect', func.array_intersect('col2', 'col_to_check')). \
show(truncate=False)
# +----+---------+------------+--------------+
# |col1|col2 |col_to_check|reqd_intersect|
# +----+---------+------------+--------------+
# |F |[A, B, C]|[B, C, D] |[B, C] |
# |G |[J, K, B]|[B, C, D] |[B] |
# |H |[C, H, D]|[B, C, D] |[C, D] |
# +----+---------+------------+--------------+

Related

How to merge multiple spark dataframe columns in one list useing Scala?

I have the following Spark dataframe in Scala:
+---------+--------------------+--------------------+--------------------+
| id| col_str_1| col_str_2| col_list|
+---------+--------------------+--------------------+--------------------+
| 1| A| C| [E, F]|
| 2| B| D| [G, H]|
+---------+--------------------+--------------------+--------------------+
Where col_str_1 and col_str_2 are of type Stirng, and col_list is of type List[String].
I want a way to transform this dataframe into the following:
+---------+--------------------+
| id| col_list|
+---------+--------------------+
| 1| [E, F, A, C]|
| 2| [G, H, B, D]|
+---------+--------------------+
Any idea?
Thank you.
You can use concat to append elements to the array column:
val df2 = df.select(
col("id"),
concat(
col("col_list"),
array(col("col_str_1"), col("col_str_2"))
).as("col_list")
)
df2.show
+---+------------+
| id| col_list|
+---+------------+
| 1|[E, F, A, C]|
| 2|[G, H, B, D]|
+---+------------+

how to update a cell of a spark data frame

I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!
You can create new column using withColumn (either putting original or updated value) and then drop isExist column.
I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))

How to aggregate sum on multi columns on Dataframe using reduce function and not groupby?

How to aggregate sum on multi columns on Dataframe using reduce function and not groupby? Since, groupby sum is taking alot of time now i am thinking of using reduce function. Any lead will be helpful.
Input:
| A | B | C | D |
| x | 1 | 2 | 3 |
| x | 2 | 3 | 4 |
CODE:
dataFrame.groupBy("A").sum()
Output:
| A | B | C | D |
| x | 3 | 5 | 7 |
You will have to convert the DataFrame to RDD to perform reduceByKey operation.
val rows: RDD[Row] = df.rdd
Once you create your RDD you can use the reduceByKey to add values of multiple columns
val input = sc.parallelize(List(("X",1,2,3),("X",2,3,4)))
input.map{ case (a, b, c, d) => ((a), (b,c,d)) }.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3))
spark.createDataFrame(final_rdd).toDF("M","N").select($"M", $"N._1".as("X"), $"N._2".as("Y"),$"N._3".as("Z")).show(10)
+---+---+---+---+
| M| X| Y| Z|
+---+---+---+---+
| X| 3| 5| 7|
+---+---+---+---+

How would you generate a new array column over a window?

I'm trying to generate a new column that is an array over a window however it appears that the array function does not work over a window and I'm struggling to find an alternative method.
Code snippet:
df = df.withColumn('array_output', F.array(df.things_to_agg_in_array).over(Window.partitionBy("aggregate_over_this")))
Ideally what I'd like is an output that looks like the following table:
+---------------------+------------------------+--------------+
| Aggregate Over This | Things to Agg in Array | Array Output |
+---------------------+------------------------+--------------+
| 1 | C | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | F | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | K | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | L | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 2 | A | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | B | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | C | [A,B,C] |
+---------------------+------------------------+--------------+
For further context this is part of an explode which will then be rejoined onto another table based on the 'aggregate over this' and as a result only returning one instance of array_ouput.
Thanks
This solution used collect_list(), not sure if it fulfills your requirement.
myValues = [(1,'C'),(1,'F'),(1,'K'),(1,'L'),(2,'A'),(2,'B'),(2,'C')]
df = sqlContext.createDataFrame(myValues,['Aggregate_Over_This','Things_to_Agg_in_Array'])
df.show()
+-------------------+----------------------+
|Aggregate_Over_This|Things_to_Agg_in_Array|
+-------------------+----------------------+
| 1| C|
| 1| F|
| 1| K|
| 1| L|
| 2| A|
| 2| B|
| 2| C|
+-------------------+----------------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select Aggregate_Over_This, Things_to_Agg_in_Array, collect_list(Things_to_Agg_in_Array) over (partition by Aggregate_Over_This) as aray_output from table_view'
)
df1.show()
+-------------------+----------------------+------------+
|Aggregate_Over_This|Things_to_Agg_in_Array| aray_output|
+-------------------+----------------------+------------+
| 1| C|[C, F, K, L]|
| 1| F|[C, F, K, L]|
| 1| K|[C, F, K, L]|
| 1| L|[C, F, K, L]|
| 2| A| [A, B, C]|
| 2| B| [A, B, C]|
| 2| C| [A, B, C]|
+-------------------+----------------------+------------+

Convert the map RDD into dataframe

I am using Spark 1.6.0, I have input map RDD (key,value) pair and want to convert to dataframe.
Input format RDD:
((1, A, ABC), List(pz,A1))
((2, B, PQR), List(az,B1))
((3, C, MNR), List(cs,c1))
Output format:
+----+----+-----+----+----+
| c1 | c2 | c3 | c4 | c5 |
+----+----+-----+----+----+
| 1 | A | ABC | pz | A1 |
+----+----+-----+----+----+
| 2 | B | PQR | az | B1 |
+----+----+-----+----+----+
| 3 | C | MNR | cs | C1 |
+----+----+-----+----+----+
Can someone help me on this.
I would suggest you to go with datasets as datasets are optimized and typesafe dataframes.
first you need to create a case class as
case class table(c1: Int, c2: String, c3: String, c4:String, c5:String)
then you would just need a map function to parse your data to the case class and call .toDS
rdd.map(x => table(x._1._1, x._1._2, x._1._3, x._2(0), x._2(1))).toDS().show()
you should have following output
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 1| A|ABC| pz| A1|
| 2| B|PQR| az| B1|
| 3| C|MNR| cs| c1|
+---+---+---+---+---+
you can use dataframe as well, for that you can use .toDF() instead of .toDS().
val a = Seq(((1,"A","ABC"),List("pz","A1")),((2, "B", "PQR"),
List("az","B1")),((3,"C", "MNR"), List("cs","c1")))
val a1 = sc.parallelize(a);
val a2 = a1.map(rec=>
(rec._1._1,rec._1._2,rec._1._3,rec._2(0),rec._2(1))).toDF()
a2.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 1| A|ABC| pz| A1|
+---+---+---+---+---+
| 2 | B |PQR| az| B1|
+---+---+---+---+---+
| 3 | C |MNR| cs| C1|