I've been trying to append 1 DataFrame to another DF in Scala. The append operation in this case is simply adding a new column of the same size to the existing column - no key matching is involved. Both DataFrames are of the same shape (5 rows and 1 column only).
scala> val coefficients = lrModel.coefficients.toArray.toSeq.toDF("coefficients")
coefficients: org.apache.spark.sql.DataFrame = [coefficients: double]
scala> coefficients.show()
+--------------------+
| coefficients|
+--------------------+
| -59525.0697785032|
| 6957.836000531959|
| 314.2998010755629|
|-0.37884289844065666|
| -1758.154438149325|
+--------------------+
scala> val tvalues = trainingSummary.tValues.toArray.drop(1).toSeq.toDF("t-values")
tvalues: org.apache.spark.sql.DataFrame = [t-values: double]
scala> tvalues.show()
+-------------------+
| t-values|
+-------------------+
| 1.8267249911295418|
| 100.35507390273406|
| -8.768588605222108|
|-0.4656738230173362|
| 10.48091833711012|
+-------------------+
The join() function runs and I can even get the schema, but when I want to display all values of the new DF I'm getting the error:
scala> val outputModelDF1 = coefficients.join(tvalues)
outputModelDF1: org.apache.spark.sql.DataFrame = [coefficients: double, t-values: double]
scala> outputModelDF1.printSchema()
root
|-- coefficients: double (nullable = false)
|-- t-values: double (nullable = false)
scala> outputModelDF1.show()
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [value#359 AS coefficients#361]
+- LocalRelation [value#359]
and
Project [value#368 AS t-values#370]
+- LocalRelation [value#368]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1062)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
... 52 elided
Any idea how to deal with it and how to simply merge these two DFs together?
UPDATE 1
I should have stated the desired format of the output that I want to achieve. Please see below:
+--------------------+--------------------+
| coefficients| t-values|
+--------------------+--------------------+
| -59525.0697785032| 1.8267249911295418|
| 6957.836000531959| 100.35507390273406|
| 314.2998010755629| -8.768588605222108|
|-0.37884289844065666| -0.4656738230173362|
| -1758.154438149325| -1758.154438149325|
+--------------------+--------------------+
UPDATE 2
Unfortunately, the following approach using withColumn() didn't work.
scala> val outputModelDF1 = coefficients.withColumn("t-values", tvalues("t-values"))
org.apache.spark.sql.AnalysisException: resolved attribute(s) t-values#119 missing from coefficients#113 in operator !Project [coefficients#113, t-values#119 AS t-values#130];;
!Project [coefficients#113, t-values#119 AS t-values#130]
+- Project [value#111 AS coefficients#113]
+- LocalRelation [value#111]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2872)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1153)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1908)
... 52 elided
One approach would be to create key columns in the dataframes for the join using monotonicallyIncreasingId:
val df1 = Seq(
(-59525.0697785032), (6957.836000531959), (314.2998010755629), (-0.37884289844065666), (-1758.154438149325)
).toDF("coefficients")
val df2 = Seq(
(1.8267249911295418), (100.35507390273406), (-8.768588605222108), (-0.4656738230173362), (10.48091833711012)
).toDF("t-values")
val df1R = df1.withColumn("rowid", monotonicallyIncreasingId)
val df2R = df2.withColumn("rowid", monotonicallyIncreasingId)
val dfJoined = df1R.join(df2R, Seq("rowid"))
dfJoined.show
+-----+--------------------+-------------------+
|rowid| coefficients| t-values|
+-----+--------------------+-------------------+
| 0| -59525.0697785032| 1.8267249911295418|
| 1| 6957.836000531959| 100.35507390273406|
| 2| 314.2998010755629| -8.768588605222108|
| 3|-0.37884289844065666|-0.4656738230173362|
| 4| -1758.154438149325| 10.48091833711012|
+-----+--------------------+-------------------+
Related
In the below I'm code trying to merge a dataframe to a delta table.
Here I'm joining the new dataframe with the delta table and then transforming the joined data to match the delta table schema, and then merging that into the delta table.
But I'm getting AnalysisException.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#514 missing from _file_name_#872,age#516,id#879,name#636,age#881,name#880,city#882,id#631,_row_id_#866L,city#641 in operator !Join Inner, (id#514 = id#631). Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;;
!Join Inner, (id#514 = id#631)
:- SubqueryAlias deltaData
: +- Project [id#631, name#636, age#516, city#641]
: +- Project [age#516, id#631, name#636, new_city#510 AS city#641]
: +- Project [age#516, id#631, new_name#509 AS name#636, new_city#510]
: +- Project [age#516, new_id#508 AS id#631, new_name#509, new_city#510]
: +- Project [age#516, new_id#508, new_name#509, new_city#510]
: +- Join Inner, (id#514 = new_id#508)
: :- Relation[id#514,name#515,age#516,city#517] parquet
: +- LocalRelation [new_id#508, new_name#509, new_city#510]
+- Project [id#879, name#880, age#881, city#882, _row_id_#866L, input_file_name() AS _file_name_#872]
+- Project [id#879, name#880, age#881, city#882, monotonically_increasing_id() AS _row_id_#866L]
+- Project [id#854 AS id#879, name#855 AS name#880, age#856 AS age#881, city#857 AS city#882]
+- Relation[id#854,name#855,age#856,city#857] parquet
My setup is Spark 3.0.0, Delta Lake 0.7.0, Hadoop 2.7.4
But the below code is running fine in Databricks 7.4 runtime, and the new dataframe is getting merged with the delta table
Code Snippet:
import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{SaveMode, SparkSession}
object CodePen extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val deltaPath = "<delta-path>"
val oldEmployee = Seq(
Employee(10, "Django", 22, "Bangalore"),
Employee(11, "Stephen", 30, "Bangalore"),
Employee(12, "Calvin", 25, "Hyderabad"))
val newEmployee = Seq(EmployeeNew(10, "Django", "Bangkok"))
spark.createDataFrame(oldEmployee).write.format("delta").mode(SaveMode.Overwrite).save(deltaPath) // Saving the data to a delta table
val newDf = spark.createDataFrame(newEmployee)
val deltaTable = DeltaTable.forPath(deltaPath)
val joinedDf = deltaTable.toDF.join(newDf, col("id") === col("new_id"), "inner")
joinedDf.show()
val cols = newDf.columns
// Transforming the joined Dataframe to match the schema of the delta table
var intDf = joinedDf.drop(cols.map(removePrefix): _*)
for (column <- newDf.columns)
intDf = intDf.withColumnRenamed(column, removePrefix(column))
intDf = intDf.select(deltaTable.toDF.columns.map(col): _*)
deltaTable.toDF.show()
intDf.show()
deltaTable.as("oldData")
.merge(
intDf.as("deltaData"),
col("oldData.id") === col("deltaData.id"))
.whenMatched()
.updateAll()
.execute()
deltaTable.toDF.show()
def removePrefix(column: String) = {
column.replace("new_", "")
}
}
case class Employee(id: Int, name: String, age: Int, city: String)
case class EmployeeNew(new_id: Int, new_name: String, new_city: String)
Below is the output of the dataframes.
New Dataframe:
+---+------+-------+
| id| name| city|
+---+------+-------+
| 10|Django|Bangkok|
+---+------+-------+
Joined Datafame:
+---+------+---+---------+------+--------+--------+
| id| name|age| city|new_id|new_name|new_city|
+---+------+---+---------+------+--------+--------+
| 10|Django| 22|Bangalore| 10| Django| Bangkok|
+---+------+---+---------+------+--------+--------+
Delta Table Data:
+---+-------+---+---------+
| id| name|age| city|
+---+-------+---+---------+
| 11|Stephen| 30|Bangalore|
| 12| Calvin| 25|Hyderabad|
| 10| Django| 22|Bangalore|
+---+-------+---+---------+
Transformed New Dataframe:
+---+------+---+-------+
| id| name|age| city|
+---+------+---+-------+
| 10|Django| 22|Bangkok|
+---+------+---+-------+
You are getting this AnalysisException because the schemas of deltaTable and intDf are slightly different:
deltaTable.toDF.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- city: string (nullable = true)
intDf.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- city: string (nullable = true)
Due to the fact, that intDf table resulted out of a join where the column "id" is used as a key, it will force your join condition column to be non-nullable.
If you change the nullale property as explained here you will get the desired output:
+---+-------+---+---------+
| id| name|age| city|
+---+-------+---+---------+
| 11|Stephen| 30|Bangalore|
| 12| Calvin| 25|Hyderabad|
| 10| Django| 22| Bangkok|
+---+-------+---+---------+
Tested with Spark 3.0.1 and Delta 0.7.0.
This is a follow up question of my previous question
scala> val map1 = spark.sql("select map('s1', 'p1', 's2', 'p2', 's3', 'p3') as lookup")
map1: org.apache.spark.sql.DataFrame = [lookup: map<string,string>]
scala> val ds1 = spark.sql("select 'p1' as p, Array('s2','s3') as c")
ds1: org.apache.spark.sql.DataFrame = [p: string, c: array]
scala> ds1.createOrReplaceTempView("ds1")
scala> map1.createOrReplaceTempView("map1")
scala> map1.show()
+--------------------+
| lookup|
+--------------------+
|[p1 -> s1, p2 -> ...|
+--------------------+
scala> ds1.show()
+---+--------+
| p| c|
+---+--------+
| p1|[s2, s3]|
+---+--------+
map1.selectExpr("element_at(`lookup`, 's2')").first()
res50: org.apache.spark.sql.Row = [p2]
scala> spark.sql("select element_at(`lookup`, 's1') from map1").show()
+----------------------+
|element_at(lookup, s1)|
+----------------------+
| p1|
+----------------------+
So far so good. In my next two steps I am hitting some issues:
scala> ds1.selectExpr("p", "c", "transform(c, cs -> map1.selectExpr('element_at(`lookup`, cs)')) as cs").show()
20/09/28 19:44:59 WARN HiveConf: HiveConf of name
hive.stats.jdbc.timeout does not exist 20/09/28 19:44:59 WARN
HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/09/28 19:45:03 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so
recording the schema version 2.3.0 20/09/28 19:45:03 WARN ObjectStore:
setMetaStoreSchemaVersion called but recording version is disabled:
version = 2.3.0, comment = Set by MetaStore root#10.1.21.76 20/09/28
19:45:03 WARN ObjectStore: Failed to get database map1, returning
NoSuchObjectException org.apache.spark.sql.AnalysisException:
Undefined function: 'selectExpr'. This function is neither a
registered temporary function nor a permanent function registered in
the database 'map1'.; line 1 pos 19
scala> spark.sql("""select p, c, transform(c, cs -> (select element_at(`lookup`, cs) from map1)) cc from ds1""").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'cs' given
input columns: [map1.lookup]; line 1 pos 61; 'Project [p#329, c#330,
transform(c#330, lambdafunction(scalar-subquery#713 [], lambda cs#715,
false)) AS cc#714] : +- 'Project
[unresolvedalias('element_at(lookup#327, 'cs), None)] : +-
SubqueryAlias map1 : +- Project [map(s1, p1, s2, p2, s3, p3) AS
lookup#327] : +- OneRowRelation
+- SubqueryAlias ds1 +- Project [p1 AS p#329, array(s2, s3) AS c#330]
+- OneRowRelatio
How can I solve these issues?
Simply add the table name to the from clauses.
spark.sql("""select p, c, transform(c, cs -> element_at(`lookup`, cs)) cc from ds1 a, map1 b""").show()
+---+--------+--------+
| p| c| cc|
+---+--------+--------+
| p1|[s2, s3]|[p2, p3]|
+---+--------+--------+
If map1 doesn't have too many rows, you could do a cross join with the set of all values extracted from the array(s) in the c columns.
spark.sql("select col as value, element_at(map1.lookup, col) as key +
"from (select explode(ds1.c) from ds1) as v cross join map1")
Result (assigning the above to a value of type DataFrame, and calling .show):
+-----+---+
|value|key|
+-----+---+
| s2| p2|
| s3| p3|
+-----+---+
I have written a scala function to join 2 dataframes with same schema, says df1 and df2. For every key in df1, if df1's key matches with df2, then we pick up values from df2 for this key, if no then leave df1's value. It supposed to return dataframe with same number of df1 but different value, but the function doesn't work and return same df as df1.
def joinDFwithConditions(df1: DataFrame, df2: DataFrame, key_seq: Seq[String]) ={
var final_df = df1.as("a").join(df2.as("b"), key_seq, "left_outer")
//set of non-key columns
val col_str = df1.columns.toSet -- key_seq.toSet
for (c <- col_str){ //for every match-record, check values from both dataframes
final_df = final_df
.withColumn(s"$c",
when(col(s"b.$c").isNull || col(s"b.$c").isNaN,col(s"a.$c"))
.otherwise(col(s"b.$c")))
// I used to re-assign value with reference "t.$c",
// but return error says no t.col found in schema
}
final_df.show()
final_df.select(df1.columns.map(x => df1(x)):_*)
}
def main(args: Array[String]) {
val sparkSession = SparkSession.builder().appName(this.getClass.getName)
.config("spark.hadoop.validateOutputSpecs", "false")
.enableHiveSupport()
.getOrCreate()
import sparkSession.implicits._
val df1 = List(("key1",1),("key2",2),("key3",3)).toDF("x","y")
val df2 = List(("key1",9),("key2",8)).toDF("x","y")
joinDFwithConditions(df1, df2, Seq("x")).show()
sparkSession.stop()
}
df1 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||1 |
| key2 ||2 |
| key3 ||3 |
--------------------------------------
df2 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
--------------------------------------
expected results:
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
| key3 ||3 |
--------------------------------------
what really shows:
+-------+---+---+
| x | y| y|
+-------+---+---+
| key1 | 9| 9|
| key2 | 8| 8|
| key3 | 3| 3|
+-------+---+---+
error message
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Resolved attribute(s) y#6 missing from x#5,y#21,y#22 in operator !Project [x#5, y#6]. Attribute(s) with the same name appear in the operation: y. Please check if the right attribute(s) are used.;;
!Project [x#5, y#6]
+- Project [x#5, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#21, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#22]
+- Project [x#5, y#6, y#15]
+- Join LeftOuter, (x#5 = x#14)
:- SubqueryAlias `a`
: +- Project [_1#2 AS x#5, _2#3 AS y#6]
: +- LocalRelation [_1#2, _2#3]
+- SubqueryAlias `b`
+- Project [_1#11 AS x#14, _2#12 AS y#15]
+- LocalRelation [_1#11, _2#12]
When you do df.as("a"), you do not rename the column of the dataframe. You simply allow to access them with a.columnName in order to lift an ambiguity. Therefore, your when goes well because you use aliases but you end up with multiple y columns. I am quite surprised by the way that it manages to replace one of the y columns...
When you try to access it with its name y however (without prefix), spark does know which one you want and throws an error.
To avoid errors, you could simply do everything you need with one select like this:
df1.as("a").join(df2.as("b"), key_cols, "left_outer")
.select(key_cols.map(col) ++
df1
.columns
.diff(key_cols)
.map(c => when(col(s"b.$c").isNull || col(s"b.$c").isNaN, col(s"a.$c"))
.otherwise(col(s"b.$c"))
.alias(c)
) : _*)
I have two Spark DataFrames:
df1 with 80 columns
CO01...CO80
+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+
and df2 with 80 columns
avg(CO01)...avg(CO80)
which is mean of each column
+------------------+------------------+
| avg(CO01)| avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+
How can i subtract df2 from df1 for corresponding values?
I'm looking for solution that does not require to list all the columns.
P.S
In pandas it could be simply done by:
df2=df1-df1.mean()
Here is what you can do
scala> val df = spark.sparkContext.parallelize(List(
| (2.06,0.56),
| (1.96,0.72),
| (1.70,0.87),
| (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction
scala>
scala> val result = df.columns.foldLeft(df)( (df, col) =>
| { val avg = df.select(mean(col)).first().getAs[Double](0);
| df.withColumn(col, subMean(avg)(df(col)))
| })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> result.show(10, false)
+---------------------+---------------------+
|c1 |c2 |
+---------------------+---------------------+
|0.15500000000000025 |-0.13749999999999996 |
|0.05500000000000016 |0.022499999999999964 |
|-0.20499999999999985 |0.1725 |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+
Hope, this helps!
Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type
I have two dataframes:
dataframe1
DATE1|
+----------+
|2017-01-08|
|2017-10-10|
|2017-05-01|
dataframe2
|NAME | SID| DATE1| DATE2|ROLL| SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0| 8/1/2017| 7 1 2017|3223| BHABHA|
|ADARSH| 2.0|10-10-2017|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0| 1.5.2017| 1/2/2017| 111| DAV|
Expected output
| NAME| SID| DATE1| DATE2|ROLL| SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0|2017-01-08| 7 1 2017|3223| BHABHA|
|ADARSH| 2.0|2017-10-10|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0|2017-05-01| 1/2/2017| 111| DAV|
I want to replace the DATE1 column in the dataframe2 with the DATE1 column of the dataframe1. I need a generic solution.
Any help will be appreciated.
I have tried withColumn method as following
dataframe2.withColumn(newColumnTransformInfo._1, dataframe1.col("DATE1").cast(DateType))
But, I'm getting an error:
org.apache.spark.sql.AnalysisException: resolved attribute(s)
You cannot add the column from another dataframe
What you could do is join the two dataframes and keep the column you want, Both the dataframe must have a common join column. If you do not have a common column and data is in order you can assign a increasing id for both dataframe and then join.
Here is the simple example of your case
//Dummy data
val df1 = Seq(
("2017-01-08"),
("2017-10-10"),
("2017-05-01")
).toDF("DATE1")
val df2 = Seq(
("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")
//create new Dataframe1 with new column id
val rows1 = df1.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))
//create new Dataframe2 with new column id
val rows2= df2.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))
dataframe2.drop("DATE1")
.join(dataframe1, "id")
.drop("id").show()
Output:
+------+----+----------+----+--------+----------+
| NAME| SID| DATE2|ROLL| SCHOOL| DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0| 7 1 2017|3223| BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0| 1/2/2017| 111| DAV|2017-05-01|
+------+----+----------+----+--------+----------+
Hope this helps!