Spark scala : select column name from other dataframe - scala

There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.

For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!

Related

How to concat two dataframes in which one is having record and other one is empty in pyspark?

I need help to concat two dataframes in which one is empty and other one having the data. Could you please how to do this in pyspark?
pandas I am using:
suppose df2 is empty and df1 is having some record.
df2 = pd.concat([df2, df1])
But how to perform this operation in pyspark?
df1:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
df2:
++
||
++
++
I tried many option. One option worked for me.
For concat df2 to df1, first I need to create the structure of df2 same like df1 then use the union for concatanation.
df2 = sqlContext.createDataFrame(sc.emptyRDD(), df1.schema)
df2 = df2.union(df1)
result:
df2:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
You can use the union method:
df = df1.union(df2)

Joining two dataframes without a common column

I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example
val df1 has
Customer_name
Customer_phone
Customer_age
val df2 has
Order_name
Order_ID
These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below
val dfr=df1.withColumn("row_index",monotonically_increasing_id()).
But as i am using Spark 2, monotonically_increasing_id method is not supported. Is there any way to join two dataframe, so that I can create the value of two dataframe in a single sheet of excel file.
For example
val df1:
Customer_name Customer_phone Customer_age
karti 9685684551 24
raja 8595456552 22
val df2:
Order_name Order_ID
watch 1
cattoy 2
My final excel sheet should be like this:
Customer_name Customer_phone Customer_age Order_name Order_ID
karti 9685684551 24 watch 1
raja 8595456552 22 cattoy 2
add an index column to both dataframe using the below code
df1.withColumn("id1",monotonicallyIncreasingId)
df2.withColumn("id2",monotonicallyIncreasingId)
then join both the dataframes using the below code and drop the index column
df1.join(df2,col("id1")===col("id2"),"inner")
.drop("id1","id2")
monotonically_increasing_id() is increasing and unique but not consecutive.
You can use zipWithIndex by converting to rdd and reconstructing Dataframe with the same schema for both dataframe.
import spark.implicits._
val df1 = Seq(
("karti", "9685684551", 24),
("raja", "8595456552", 22)
).toDF("Customer_name", "Customer_phone", "Customer_age")
val df2 = Seq(
("watch", 1),
("cattoy", 2)
).toDF("Order_name", "Order_ID")
val df11 = spark.sqlContext.createDataFrame(
df1.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df1.schema.fields :+ StructField("index", LongType, false))
)
val df22 = spark.sqlContext.createDataFrame(
df2.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df2.schema.fields :+ StructField("index", LongType, false))
)
Now join the final dataframes
df11.join(df22, Seq("index")).drop("index")
Output:
+-------------+--------------+------------+----------+--------+
|Customer_name|Customer_phone|Customer_age|Order_name|Order_ID|
+-------------+--------------+------------+----------+--------+
|karti |9685684551 |24 |watch |1 |
|raja |8595456552 |22 |cattoy |2 |
+-------------+--------------+------------+----------+--------+

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")

check condition for two column in two different dataframes in spark

Suppose there is one column in dataframe and there is similar schema column in another dataframe. how to check check the values consisting in the columns are same or not without joining them as there is not common attribute.
DF1
serial_nm
abc
mnc
pqr
DF2
ser_nm
hgf
mnc
uio
pqr
lok
And i want third DF3 as output
DF3
mnc
pqr
I tried this
val DF3 = DF1.filter(DF1("serial_nm") === DF2("ser_nm"))
But its not working
Please Help
Thanks..!!
I believe you can use a join. Consider using it like this:
val DF3 = DF1.join(DF2, DF1("serial_nm") === DF2("ser_nm"))
or
val DF3 = DF1.join(DF2).where(DF1("serial_nm") === DF2("ser_nm"))
Both approaches are quivalent.
Note: To avoid problems with ambiguous columns, one option is to rename them before the join:
val df2_renamed = DF2
.withColumnRenamed("mnc", "df2_mnc")
.withColumnRenamed("pqr", "df2_pqr")