Drop list of Column from a single dataframe in spark - scala

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var?
Currently I've done this
var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))
but what I really want is just a shortcut for
df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....
without using a var.
Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2
EDIT:
Important note: I don't know in advance the columns of df1 and df2

This is possible to achieve while you are performing the join itself. Please try the below code
val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")
This would only contain the columns from the second data frame. Hope this helps

A shortcut would be:
val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))
I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:
val ret = df3.select(df2.columns.map(col):_*)

Related

How to drop specific column and then select all columns from spark dataframe

I have a scenario here - Have 30 columns in one dataframe, need to drop specific column and select remaining columns and put it to another dataframe. How can I achieve this? I tried below.
val df1: DataFrame = df2.as(a).join( df3.as(b),col(a.key) === col(b.key), inner).drop(a.col1)
.select(" a.star ")
when I do show of df1, its still show col1. Any advise on resolving this.
drop requires a string without table alias, so you can try:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.drop("col1")
.select("a.*")
Or instead of dropping the column, you can filter the columns to be selected:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.select(df2.columns.filterNot(_ == "col1").map("a." + _): _*)
This really just seems like you need to use a "left_semi" join.
val df1 = df2.drop('col1).join(df3, df2("key") === df3("key"), "left_semi")
If key is an actual column you can simplify the syntax even further
val df1 = df2.drop('col1).join(df3, Seq("key"), "left_semi")
The best syntax depends on the details of what your real data looks like. If you need to refer to col1 in df2 specifically because there's ambiguity, then use df2("col1")
left_semi joins take all the columns from the left table for rows finding a match in the right table.

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

check condition for two column in two different dataframes in spark

Suppose there is one column in dataframe and there is similar schema column in another dataframe. how to check check the values consisting in the columns are same or not without joining them as there is not common attribute.
DF1
serial_nm
abc
mnc
pqr
DF2
ser_nm
hgf
mnc
uio
pqr
lok
And i want third DF3 as output
DF3
mnc
pqr
I tried this
val DF3 = DF1.filter(DF1("serial_nm") === DF2("ser_nm"))
But its not working
Please Help
Thanks..!!
I believe you can use a join. Consider using it like this:
val DF3 = DF1.join(DF2, DF1("serial_nm") === DF2("ser_nm"))
or
val DF3 = DF1.join(DF2).where(DF1("serial_nm") === DF2("ser_nm"))
Both approaches are quivalent.
Note: To avoid problems with ambiguous columns, one option is to rename them before the join:
val df2_renamed = DF2
.withColumnRenamed("mnc", "df2_mnc")
.withColumnRenamed("pqr", "df2_pqr")