How to subtract DataFrames using subset of columns in Apache Spark

How to subtract DataFrames using subset of columns in Apache Spark - scala

How can I perform filter operation on Dataframe1 using Dataframe2.
I want to remove rows from DataFrame1 for below matching condition
Dataframe1.col1 = Dataframe2.col1
Dataframe1.col2 = Dataframe2.col2
My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns

join with "left_anti"
scala> df1.show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
| 1| one| ek|
| 2| two| dho|
| 3|three|theen|
| 4| four|chaar|
+----+-----+-----+
scala> df2.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
| 2| two| dho|
| 4|four|chaar|
+----+----+-----+
scala> df1.join(df2, Seq("col1", "col2"), "left_anti").show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
| 1| one| ek|
| 3|three|theen|
+----+-----+-----+

Possible duplicate of :Spark: subtract two DataFrames if both datasets have exact same coulmns
If you want custom join condition then you can use "anti" join. Here is the pysaprk version
Creating two data frames:
Dataframe1 :
l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')
df1.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row3| 30|
+---------+----+
Dataframe2 :
l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row4| 40|
+---------+----+
Using subtract api :
df_final = df1.subtract(df2)
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+
Using left_anti :
Join condition:
join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]
Join finally
df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+

Related

Join two dataframes and replace the original column values using Spark Scala

I have two DFs
df1:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.0|20210101|
| 2| 2.0|20210101|
| 3| 3.0|20210101|
+---+-----+--------+
df2:
+---+-----+
|key|price|
+---+-----+
| 1| 1.1|
| 2| 2.2|
| 3| 3.3|
+---+-----+
I'd like to replace price column values from df1 with price values from df2 where df1.key == df2.key
Expected output:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.1|20210101|
| 2| 2.1|20210101|
| 3| 3.3|20210101|
+---+-----+--------+
I've found some solutions in python but I couldn't come up with a working solution in Scala.

Simply join + drop df1 column price:
val df = df1.join(df2, Seq("key")).drop(df1("price"))
df.show
//+---+-----+--------+
//|key|price| date|
//+---+-----+--------+
//| 1| 1.1|20210101|
//| 2| 2.2|20210101|
//| 3| 3.3|20210101|
//+---+-----+--------+
Or if you have more entries in df1 and you want to keep their price when there is no match in df2 then use left join + coalesce expression:
val df = df1.join(df2, Seq("key"), "left").select(
col("key"),
col("date"),
coalesce(df2("price"), df1("price")).as("price")
)

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?

Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+

I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10

This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+

df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)

I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )

val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.

In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

I am working on a problem in which I am loading data from a hive table into spark dataframe and now I want all the unique accts in 1 dataframe and all duplicates in another. for example if I have acct id 1,1,2,3,4. I want to get 2,3,4 in one dataframe and 1,1 in another. How can I do this?

Depending on the version of spark you have, you could use window functions in datasets/sql like below:
Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );
Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));
Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));
the above is written in java. should be similar in scala and read this on how to do in python.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

df.groupBy($"field1",$"field2"...).count.filter($"count" > 1).show()

val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
scala> acctDF.show()
+------+-------+
|AcctId|Details|
+------+-------+
| 1| Acc1|
| 1| Acc1|
| 1| Acc1|
| 2| Acc2|
| 2| Acc2|
| 3| Acc3|
+------+-------+
// Need to convert the DF to rdd to apply map and reduceByKey and again to DF to use it further more
val countsDF = acctDF.rdd.map(rec => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")
val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))
scala> accJoinedDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
| 3| Acc3| 1|
+------+-------+---------+
val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
scala> distAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 3| Acc3| 1|
+------+-------+---------+
val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
scala> duplAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
+------+-------+---------+
(OR scala> duplAcctDF.distinct.show() )

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to subtract DataFrames using subset of columns in Apache Spark - scala

Related

Join two dataframes and replace the original column values using Spark Scala

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

Unsure how to apply row-wise normalization on pyspark dataframe

Spark dataframe filter

In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

Categories

Resources