how to update second dataframe's exists value if row exists in first dataframe - scala

i have two data frames i want to check if df1 contains any row in df2 where key is a and b, if equal then change exists to true in df2 and add the new rows from df1 with exists False
df1
a | b | c | d
1 | 1 | 3 | 4
2 | 2 | 4 | 1
3 | 3 | 5 | 3
df2
a | b | c | d
1 | 1 | 4 | 5
4 | 4 | 3 | 2
this should look like
df3
a | b | c | d | exists
1 | 1 | 4 | 5 | True
4 | 4 | 3 | 2 | False
1 | 1 | 3 | 4 | False
2 | 2 | 4 | 1 | False
3 | 3 | 5 | 3 | False
so far i have this
val newdf = df1.join(df2, df1("a")===df2("a") && df1("b") === df2("b"), "left")
.select(df2("a"), df2("b"),df2("c"),df2("d"),when(df2("a").isNull, false).otherwise(true).alias("exists"))
which returns
a | b | c | d | exists
1 | 1 | 4 | 5 | True
rest of the rows are null

Try with left_semi, left_anti joins then unionAll the datasets.
Example:
df2.join(df1,Seq("a","b"),"left_semi").withColumn("exists",lit("True")).
unionAll(df2.join(df1,Seq("a","b"),"left_anti").withColumn("exists",lit("False"))).
unionAll(df1.withColumn("exists",lit("False"))).show()
//+---+---+---+---+------+
//| a| b| c| d|exists|
//+---+---+---+---+------+
//| 1| 1| 4| 5| True|
//| 4| 4| 3| 2| False|
//| 1| 1| 3| 4| False|
//| 2| 2| 4| 1| False|
//| 3| 3| 5| 3| False|
//+---+---+---+---+------+

Related

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

Pyspark Join Tables

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()
It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+