How to drop specific column and then select all columns from spark dataframe - scala

I have a scenario here - Have 30 columns in one dataframe, need to drop specific column and select remaining columns and put it to another dataframe. How can I achieve this? I tried below.
val df1: DataFrame = df2.as(a).join( df3.as(b),col(a.key) === col(b.key), inner).drop(a.col1)
.select(" a.star ")
when I do show of df1, its still show col1. Any advise on resolving this.

drop requires a string without table alias, so you can try:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.drop("col1")
.select("a.*")
Or instead of dropping the column, you can filter the columns to be selected:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.select(df2.columns.filterNot(_ == "col1").map("a." + _): _*)

This really just seems like you need to use a "left_semi" join.
val df1 = df2.drop('col1).join(df3, df2("key") === df3("key"), "left_semi")
If key is an actual column you can simplify the syntax even further
val df1 = df2.drop('col1).join(df3, Seq("key"), "left_semi")
The best syntax depends on the details of what your real data looks like. If you need to refer to col1 in df2 specifically because there's ambiguity, then use df2("col1")
left_semi joins take all the columns from the left table for rows finding a match in the right table.

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,
You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)
Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

Drop list of Column from a single dataframe in spark

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var?
Currently I've done this
var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))
but what I really want is just a shortcut for
df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....
without using a var.
Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2
EDIT:
Important note: I don't know in advance the columns of df1 and df2
This is possible to achieve while you are performing the join itself. Please try the below code
val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")
This would only contain the columns from the second data frame. Hope this helps
A shortcut would be:
val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))
I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:
val ret = df3.select(df2.columns.map(col):_*)

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")

Spark Scala Dynamic column selection from DataFrame

I have a DataFrame which have different type of columns. Among those column, i need to retrieve specific column from that DataFrame.
Hard coded DataFrame select statement will be like this:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"),
col("FEATURE_COL1"), col("FEATURE_COL2"), col("FEATURE_COL3"), col("FEATURE_COL4"))
Where LEBEL_COLUMN and FEATURE_COLs will be dynamic.
I have Array or Seq for those FEATURE Columns like this:
val FEATURE_COL_ARR = Array("FEATURE_COL1","FEATURE_COL2","FEATURE_COL3","FEATURE_COL4")
I need to use this Array of column collection with that SELECT statement in the 2nd part.
In the select, 1st column will be one (LABEL_COLUMN) and rest will be dynamic list.
Can you please help me to make the select statement working in SCALA.
Note:
The sample code given bellow is working, but i need to add column array in the 2nd part of the SELECT
val colNames = FEATURE_COL_ARR.map(name => col(name))
val logRegrDF = myDF.select(colNames:_*) // it is not the requirement
I am thinking for 2nd part code will be like this, but it is not working:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"), colNames:_*)
If I understand your question, I hope this is what you are looking for
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
result.select("LEBEL_COLUMN", allColumnsArr: _*)
.withColumnRenamed("LEBEL_COLUMN", "label")
Hope this helps!
Thanks a lot #Shankar.
Though your given suggestion is not working, but i got an idea from your suggestion and solved the issue by this way
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
val colNames = allColumnsArr.map(name => col(name))
myDF.select(colNames:_*).withColumnRenamed("LEBEL_COLUMN", "label")
Also this way without creating DataFrame column:
result.select(LEBEL_COLUMN, FEATURE_COL_ARR: _*) .withColumnRenamed(LEBEL_COLUMN, "label")

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)