Compare two dataframes and update the values - scala

I have two dataframes like following.
val file1 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file1.csv")
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
+---+-------+-----+-----+-------+
val file2 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file2.csv")
file2.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 70| 5| 0|
+---+-------+-----+-----+-------+
Now I am comparing two dataframes and filtering out the mismatch values like this.
val columns = file1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => file1.select(col).except(file2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
+-----+
|mark1|
+-----+
| 10|
+-----+
I need to add the extra row into the dataframe, 1 for the mismatch value from the dataframe 2 and update the version number like this.
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
| 3| Teju | 70| 5| 1|
+---+-------+-----+-----+-------+
I am struggling to achieve the above step and it is my expected output. Any help would be appreciated.

You can get your final dataframe by using except and union as following
val count = file1.count()
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
file1.union(file2.except(file1)
.withColumn("version", lit(1)) //changing the version
.withColumn("id", (row_number.over(Window.orderBy("id")))+lit(count)) //changing the id number
)
lit, row_number and window functions are used to generate the id and versions
Note : use of window function to generate the new id makes the process inefficient as all the data would be collected in one executor for generating new id

Related

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Scala Spark creating a new column in the dataframe based on the aggregate count of values in another column

I have a spark dataframe like the on below
+-----+----------+----------+
| ID| date| count |
+-----+----------+----------+
|54500|2016-05-02| 0|
|54500|2016-05-09| 0|
|54500|2016-05-16| 0|
|54500|2016-05-23| 0|
|54500|2016-06-06| 0|
|54500|2016-06-13| 0|
|54441|2016-06-20| 0|
|54441|2016-06-27| 0|
|54441|2016-07-04| 0|
|54441|2016-07-11| 0|
+-----+----------+----------+
I want to add an additional column that contains the count of records for a specific id in the dataframe while avoiding the for loop . The target dataframe looks like below
+-----+----------+----------+
| ID| date| count |
+-----+----------+----------+
|54500|2016-05-02| 6|
|54500|2016-05-09| 6|
|54500|2016-05-16| 6|
|54500|2016-05-23| 6|
|54500|2016-06-06| 6|
|54500|2016-06-13| 6|
|54441|2016-06-20| 4|
|54441|2016-06-27| 4|
|54441|2016-07-04| 4|
|54441|2016-07-11| 4|
+-----+----------+----------+
Tried this
import org.apache.spark.sql.expressions.Window
var s = Window.partitionBy("ID")
var df2 = df.withColumn("count", count.over(s))
this is giving error
error: ambiguous reference to overloaded definition,
both method count in object functions of type (columnName: String)org.apache.spark.sql.TypedColumn[Any,Long]
and method count in object functions of type (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
match expected type ?
Follow the below approach:
import spark.implicits._
val df1 = List(54500, 54500, 54500, 54500, 54500, 54500, 54441, 54441, 54441, 54441).toDF("ID")
val df2 = df1.groupBy("ID").count()
df1.join(df2, Seq("ID"), "left").show(false)
+-----+-----+
|ID |count|
+-----+-----+
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54441|4 |
|54441|4 |
|54441|4 |
|54441|4 |
+-----+-----+

Pyspark Autonumber over a partitioning column

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.
A sample dataframe is below.
df = (sc.parallelize([
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"2345"},
{"sensitive_id":"2345"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"}
]).toDF()
.cache()
)
I would like to create a dataframe like below.
What is a way to get this done.
You are looking for dense_rank function :
df.withColumn(
"non_sensitive_id",
F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+
This is another way of doing this, may not be very efficient because join() will involve a shuffle -
Creating the DataFrame -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id'])
Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.
df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+

How to create a sequence of events (column values) per some other column?

I have a Spark data frame as shown below -
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
+-------+-------+---------+-------------+------+
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
+-------+---------------------+
|visitor|channel sequence |
+-------+---------------------+
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
+-------+---------------------+
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
.show(false)
which should give you
+-------+--------------------+
|visitor|channel sequence |
+-------+--------------------+
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
+-------+--------------------+
You can make it as much generic as you can . this is just a demo on how you should proceed