Spark Dataframe Join - Duplicate column (non-joined column) - scala

I have two Dataframes df1 (Employee table) & df2 (Department table) with following schema :
df1.columns
// Arrays(id,name,dept_id)
and
df2.columns
// Array(id,name)
After i join these two tables on df1.dept_id and df2.id :
val joinedData = df1.join(df2,df1("dept_id")===df2("id"))
joinedData.columns
// Array(id,name,dept_id,id,name)
While saving it in file,
joined.write.csv("<path>")
it gives error :
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "name", "id" found, cannot save to file.;
I read about using Sequence of Strings to avoid column duplication but that is for columns on which join is to be performed. I need a similar functionality for non-joined columns.
Is there a direct way to embed table name with repeated column so that it can be saved ?
I came up with a solution of matching columns of both dfs and renaming duplicate columns to append table-name to column-name. But is there a direct way ?
Note : This will be a generic code with only column details on which join is performed. Rest columns will be known at runtime only. So we can't rename columns by hard-coding it.

I would just keep all columns by making sure they have different names, e.g. by prepending an identifier to the column names:
val df1Cols = df1.columns
val df2Cols = df2.columns
// prefixes to column names
val df1pf = df1.select(df1Cols.map(n => col(n).as("df1_"+n)):_*)
val df2pf = df2.select(df2Cols.map(n => col(n).as("df2_"+n)):_*)
df1pf.join(df2pf,
$"df1_dept_id"===$"df2_id",
)

After further research and getting views of other developers, it's sure that there is no direct way. One way is to change all column's name as specified by #Raphael. But i solved my problem by changing only duplicate columns :
val commonCols = df1.columns.intersect(df2.columns)
val newDf2 = changeColumnsName(df2,commonCols,"df1")
where changeColumnsName definition is :
#tailrec
def changeColumnsName(dataFrame: DataFrame, columns: Array[String], tableName: String): DataFrame = {
if (columns.size == 0)
dataFrame
else
changeColumnsName(dataFrame.withColumnRenamed(columns.head, tableName + "_" + columns.head), columns.tail, tableName)
}
Now, performing join :
val joinedData = df1.join(newDf2,df1("dept_id")===newDf2("df2_id"))
joinedData.columns
// Array(id,name,dept_id,df2_id,df2_name)

You could try using alias for dataframe,
import spark.implicits._
df1.as("df1")
.join(df2.alias("df2"),df1("dept_id") === df2("id"))
.select($"df1.*",$"df2.*").show()

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")
val df = left.join(right, left.col("name") === right.col("name"))
display(df)
head(drop(join(left, right, left$name == right$name), left$name))
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

Related

How to drop specific column and then select all columns from spark dataframe

I have a scenario here - Have 30 columns in one dataframe, need to drop specific column and select remaining columns and put it to another dataframe. How can I achieve this? I tried below.
val df1: DataFrame = df2.as(a).join( df3.as(b),col(a.key) === col(b.key), inner).drop(a.col1)
.select(" a.star ")
when I do show of df1, its still show col1. Any advise on resolving this.
drop requires a string without table alias, so you can try:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.drop("col1")
.select("a.*")
Or instead of dropping the column, you can filter the columns to be selected:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.select(df2.columns.filterNot(_ == "col1").map("a." + _): _*)
This really just seems like you need to use a "left_semi" join.
val df1 = df2.drop('col1).join(df3, df2("key") === df3("key"), "left_semi")
If key is an actual column you can simplify the syntax even further
val df1 = df2.drop('col1).join(df3, Seq("key"), "left_semi")
The best syntax depends on the details of what your real data looks like. If you need to refer to col1 in df2 specifically because there's ambiguity, then use df2("col1")
left_semi joins take all the columns from the left table for rows finding a match in the right table.

How to join Dataframe with one-column dataset without using column names in dataset

Let's consider:
val columnNames: Seq[String] = Seq[String]("col_1") // column present in DataFrame df
df.join(usingColumns = columnNames, right = ds)) // ds is some dataset that has exactly one column.
// the problem is about the fact that I don't know name of this column? I only know that
// df.col("col_1") and ds.col(???)` has the same types.
Is it possible to do this join?
you can use something like :
package utils
object Extensions {
implicit class DataFrameExtensions(df: DataFrame) {
def selecti(indices: Int*) = {
val cols = df.columns
df.select(indices.map(cols(_)):_*)
}
}
}
then use this to select column by numbers :
import utils.Extensions._
df.selecti(1,2,3)
Assuming that the "col_1" from the first dataframe will always join to the single column in the ds dataframe you can just rename the column in the ds data frame with a single column like below. Then your join using names only need reference "col_1"
// set the name of the column in ds to col_1
val ds2 = ds.toDF("col_1")
You can change the column name of the dataset to col_1:
val result = df.join(ds.withColumnRenamed(ds.columns(0), "col_1"), "col_1", "right")

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))