How to combine dataframes with no common columns? - scala

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each

You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+

A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran
Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

Perform merge/insert on two spark dataframes with different schemas?

I have spark dataframe df and df1 both with different schemas.
DF:-
val DF = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")
+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
| 1| acv| 34| a| 1|
| 2| fbg| 56| b| 3|
| 3| rty| 78| c| 5|
+---+----+---+------+----+
DF1:-
val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
| 1| gbj| 67| a| 5|
| 2| gbj| 67| a| 7|
| 2| jku| 88| b| 8|
| 4| jku| 88| b| 7|
| 5| uuu| 12| c| 9|
+---+----+---+------+----+
I want to merge DF1 with DF based on value of id and DBName. So if my id and DBName already exists in DF then the record should be updated and if id and DBName doesn't exist then the new record should be added. So the resulting data frame should be like this:
+---+----+---+------+----+----+
| id|name|age|DBName|Test|col |
+---+----+---+------+----+----+
| 5| uuu| 12| c|NULL|9 |
| 2| jku| 88| b|NULL|8 |
| 4| jku| 88| b|NULL|7 |
| 1| gbj| 67| a|NULL|5 |
| 3| rty| 78| c|5 |NULL|
| 2| gbj| 67| a|NULL|7 |
+---+----+---+------+----+----+
I have tried so far
val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" && $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)
Error:-
org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;
You're selecting non-existent columns, and also there is a typo in the coalesce. You can follow the example below to fix your issue:
val updatedDF = DF.as("a").join(
DF1.as("b"),
$"a.id" === $"b.id" && $"a.DBName" === $"b.DBName",
"outer"
).select(
DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c)
:+ col(DF.columns.last)
:+ col(DF1.columns.last)
:_*
)
updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
| 5| uuu| 12| c|null| 9|
| 2| jku| 88| b| 3| 8|
| 4| jku| 88| b|null| 7|
| 1| gbj| 67| a| 1| 5|
| 3| rty| 78| c| 5|null|
| 2| gbj| 67| a|null| 7|
+---+----+---+------+----+----+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

Scala Spark - Map function referencing another dataframe

I have two dataframes:
df1:
+---+------+----+
| id|weight|time|
+---+------+----+
| A| 0.1| 1|
| A| 0.2| 2|
| A| 0.3| 4|
| A| 0.4| 5|
| B| 0.5| 1|
| B| 0.7| 3|
| B| 0.8| 6|
| B| 0.9| 7|
| B| 1.0| 8|
+---+------+----+
df2:
+---+---+-------+-----+
| id| t|t_start|t_end|
+---+---+-------+-----+
| A| t1| 0| 3|
| A| t2| 4| 6|
| A| t3| 7| 9|
| B| t1| 0| 2|
| B| t2| 3| 6|
| B| t3| 7| 9|
+---+---+-------+-----+
My desired output is to identify the 't' for each time stamp in df1, where the ranges of 't' are in df2.
df_output:
+---+------+----+---+
| id|weight|time| t |
+---+------+----+---+
| A| 0.1| 1| t1|
| A| 0.2| 2| t1|
| A| 0.3| 4| t2|
| A| 0.4| 5| t2|
| B| 0.5| 1| t1|
| B| 0.7| 3| t2|
| B| 0.8| 6| t2|
| B| 0.9| 7| t3|
| B| 1.0| 8| t3|
+---+------+----+---+
My understanding so far is that I must create an udf that takes the column 'id and 'time as inputs, map for each row, by refering to df2.filter(df2.id == df1.id, df1.time >= df2.t_start, df1.time <= df2.t_end), and get the correspondingdf2.t`
I'm very new to Scala and Spark, so I am wondering if this solution is even possible?
You cannot use UDF for that but all you have to do is to reuse filter condition you already defined to join both frames:
df1.join(
df2,
df2("id") === df1("id") && df1("time").between(df2("t_start"), df2("t_end"))
)