How to fill missing values with values from other dataframes - scala

I have one data frame with an ID:String column, a Type:Int column and a Name:String column.
This data frame has a lot of missing values in the Name column.
But I also have three other dataframes that contain an ID column and a Name column.
What I'd like to do is to fill the missing values in the first Dataframe with values from the others. The other dataframes do not contain all the IDs belonging to the first dataframe, plus they can also contain IDs that are not present in the first dataframe.
What is the right approach in this case? I Know I can combine two DFs like:
df1.join(df2, df1("ID")===df2("ID"), "left_outer")
But since I know that all entries in the first dataframe where type=2 already have a name, I'd like to restrict this join only for rows where type=1
Any idea how can I retrieve Names values from the three DFs in order to fill the Name column in my original dataframe?

You can split, join the subset of interest and gather everything back:
df1
// Select ones that may require filling
.where($"type" === 1)
// Join
.join(df2, Seq("ID"), "left_outer")
// Replace NULL if needed
.select($"ID", $"Type", coalesce(df1("Name"), df2("Name")).alias("Name"))
// Union with subset which doesn't require filling
.union(df1.where($"type" === 2)) // Or =!= 1 as suggested by #AlbertoBonsanto
If type column is nullable you should cover this scenario separately with union($"type".isNull).

Related

Compare two dataframes row count. Assign dataframe with high row count to a new dataframe object

I have two physical nodes that are not synchronised.
Both nodes produce captured data. (Two nodes technology was put in place for resilience).
I am facing following challenge:
nodes produce two identical files (timestamps may not be the same, no unique identifier in order to remove duplicates). Both frames share the same schema.
Is there a way to write in data frame using pyspark something like:
df3= case
when df1.count()<df2.count() then df2,
when df1.count()>df2.count() then df1,
ELSE df1
Resolved following case by defining "comparison" function.
def compare(df1, df2):
if df1.count() > df2.count():
return df1
if df1.count() < df2.count():
return df2
else:
return df1
Seems possibility to work with dataframes as an object can be achieved via functions

Spark scala join dataframe within a dataframe

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.
e.g.:
DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')
after doing the above I need to do the same join but use the value calculated in the previous join.
DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)
When I try to do this I get a Cartesian product issue.
You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF. If you want to have the new column - you need to use DF_Combined.
From your question i believe you don't need to have another join, but you have 2 possible options:
1. When you do first join - at this place calculate vol1*10.
2. After join do DF_Combined.withColumn....
But please remember - withColumn(name, expr) creates a column with a namesetting value to result of expr. So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

Spark: group only part of the rows in a DataFrame

From a given DataFrame, I'dl like to group only few rows together, and keep the other rows in the same dataframe.
My current solution is:
val aggregated = mydf.filter(col("check").equalTo("do_aggregate")).groupBy(...).agg()
val finalDF = aggregated.unionByName(mydf.filter(col("check").notEqual("do_aggregate")))
However I'd like to find a more eleguant and performant way.
Use a derived column to group by, depending on the check.
mydf.groupBy(when(col("check").equalTo("do_aggregate"), ...).otherwise(monotonically_increasing_id)).agg(...)
If you have a unique key in the dataframe, use that instead of monotonically_increasing_id.

Multiple data inserts in same dataframe

We have customer data in a Hive table and sales data in another Hive table, which has data in TB's. We are trying to pull the sales data for multiple customers and save it to a file.
What we tried so far:
We tired with left outer join between customer and sales tables, but because of the huge sales data it is not working.
val data = customer.join(sales,"customer.id" = "sales.customerID",leftouter)
So the alternative is to pull the data form sales table based on specific customer region list and see if this region data has the customer data, if data exist save it in other dataframe and load the data to the same dataframe for all the regions.
My question here is, whether the multiple insert of data for the dataframe is supported in spark.
If the sales dataframe is larger than the customer dataframe then you could simply switch the order of the dataframes in the join operation.
val data = sales.join(customer,"customer.id" === "sales.customerID", "left_outer")
You could also add a hint for Spark to broadcast the smaller dataframe, though I belive it needs to be smaller than 2GB:
import org.apache.spark.sql.functions.broadcast
val data = sales.join(broadcast(customer),"customer.id" === "sales.customerID", "leftouter")
To use the other approach and iterativly merge dataframes is also possible. For this purpose you can use the union method (Spark 2.0+) or unionAll (older versions). This method will append a dataframe to another. In the case where you have a list of dataframes that you want to merge with each other you can use union together with reduce:
val dataframes = Seq(df1, df2, df3)
dataframes.reduce(_ union _)

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both