AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner - scala

I have this simple code
var count = event_stream
.groupBy("value").count()
event_stream.join(count,"value").printSchema() //get error on this line
event_stream and count schemas are as follows
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- value: binary (nullable = true)
|-- count: long (nullable = false)
two questions
Why do I get this error and how to fix?
Why does groupby.count drops all other columns?
The error is as follows
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- AnalysisBarrier
: +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#7f2c57fe, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#3dbd7107,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
+- AnalysisBarrier
+- Aggregate [value#8], [value#8, count(1) AS count#46L]
+- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#7f2c57fe, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#3dbd7107,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
Conflicting attributes: value#8
EDIT: yes! the changing the name of the columns works.
But now, If I use the join, I have to use OutputMode.Append and for that, I need to add Watermarks to the stream.
What I want is extract the count and topic(from the above printed schema) in the resultingDF and write that to some Sink.
Two questions
Is there any other way/better way to do this?
Can I do multiple aggs like count() and then also add another column which is of String type i.e. topic is this case?

Why do I get this error and how to fix?
I think you are getting the error because the final joined schema contains two value fields, one from each side of the join. To fix this rename the "value" field on one of the two joined Streams like this:
var count = event_stream.
groupBy("value").count().
withColumnRenamed("value", "join_id")
event_stream.join(count, $"value" === $"join_id").
drop("join_id").
printSchema()
Why does groupby.count drops all other columns?
groupBy operations basically are dividing your fields up into two lists. A list of fields to use as the key, and a list of fields to aggregate. The key fields just show up as is in the final result, but any fields not in the list need to have an aggregate operation defined to show up in the result. Otherwise spark has no way to know how you want to combine multiple values of that field! Did you want to just count it? Did you want the max value? Did you want to see all distinct values? To specify how to rollup a field, you can define it in a .agg(..) call.
Example:
val input = Seq(
(1, "Bob", 4),
(1, "John", 5)
).toDF("key", "name", "number")
input.groupBy("key").
agg(collect_set("name") as "names",
max("number") as "maxnum").
show
+---+-----------+------+
|key|name |maxnum|
+---+-----------+------+
| 1|[Bob, John]| 5|
+---+-----------+------+

The reason for error is the column name which is used for joining .
You can use the operation like .
var count = event_stream
.groupBy("value").count()
event_stream.join(count,Seq("value"))

Related

How to filter a mongo collection based spark dataframe with a list of ObjectId

I am trying to read mongodb collections using mongo-spark-connector.jar
I have 2 collections
Agents and Users, now my tasks are:
From Agents collection I need to pick all agentIds which is of type String.
Fetch the documents from Users collection based on the Agent ids that we got from Agents collection, agentId in Agent needs to be matched with userId in Users collection.
UserId in Users collection is of type ObjectId
|-- userId: struct (nullable = true)
| |-- oid: string (nullable = true)
I am able to pull the list of agentIds from Agents Collection:
val agentIdsList = agentIdsDF.withColumnRenamed("agentId","oid").as[ObjectId].distinct().collect().toList
The issue is when I am trying to filter the Users collection with the list of agentIds I am getting error
val usersDF = usersCollDF.filter(col("user").isin(agentIdsList:_*))
Error:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class com.mongodb.spark.sql.fieldTypes.ObjectId
Could someone please help as how to filter mongodata based on ObjectId field.

Filter out rows by accessing a nested element in a row

I'm working with some Yelp data. Here is how a row from the provided .json file looks like:
{"business_id":"f9NumwFMBDn751xgFiRbNA",
"name":"The Range At Lake Norman",
"address":"10913 Bailey Rd",
"city":"Cornelius",
"state":"NC",
"postal_code":"28031",
"latitude":35.4627242,
"longitude":-80.8526119,
"stars":3.5,
"review_count":36,
"is_open":1,
"attributes":{"BusinessAcceptsCreditCards":"True","BikeParking":"True","GoodForKids":"False","BusinessParking":"{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
"ByAppointmentOnly":"False","RestaurantsPriceRange2":"3"},
"categories":"Active Life, Gun\/Rifle Ranges, Guns & Ammo, Shopping",
"hours":{"Monday":"10:0-18:0","Tuesday":"11:0-20:0","Wednesday":"10:0-18:0","Thursday":"11:0-20:0","Friday":"11:0-20:0","Saturday":"11:0-20:0","Sunday":"13:0-18:0"}}
What I'd like to do is to access the "BikeParking" attribute within the "attributes" column and filer based off its value. Right now I have something like:
df.filter(functions.explode(df['attributes']).BikeParking == False)
This, however, returns the following error:
"pyspark.sql.utils.AnalysisException: Generators are not supported outside the SELECT clause, but got: 'Filter (explode(attributes#8)[GoodForKids] = false);"
How would I be able to achieve this?
attributes column is of struct type and explode only works with array or map types.
Try by accessing struct fields as <field_name>.*
Example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark.\
read.\
option("multiline","true").\
json("yelp.json").\
filter(col("attributes.BikeParking").cast("boolean")).\
show()
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+
#| address| attributes| business_id| categories| city| hours|is_open| latitude| longitude| name|postal_code|review_count|stars|state|
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+
#|10913 Bailey Rd|[True, True, {'ga...|f9NumwFMBDn751xgF...|Active Life, Gun/...|Cornelius|[11:0-20:0, 10:0-...| 1|35.4627242|-80.8526119|The Range At Lake...| 28031| 36| 3.5| NC|
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+

How do I efficiently map keys from one dataset based on values from other dataset

Assuming data frame 1 represents target country and list of source countries and data frame 2 represents the availability for all the countries, find all the pairs from data frame 1 where target country mapping is TRUE and source country mapping is FALSE:
Dataframe 1 (targetId, sourceId):
USA: China, Russia, India, Japan
China: USA, Russia, India
Russia: USA, Japan
Dataframe 2 (id, available):
USA: true
China: false
Russia: true
India: false
Japan: true
Result Dataset should look like:
(USA, China),
(USA, India)
My idea is to first explode the data set1, create new data frame (say, tempDF), add 2 new columns to it: targetAvailable, sourceAvailable and finally filter for targetAvailable = false and sourceAvailable = true to get the desired result data frame.
Below is the snippet of my code:
val sourceDF = sourceData.toDF("targetId", "sourceId")
val mappingDF = mappingData.toDF("id", "available")
val tempDF = sourceDF.select(col("targetId"),
explode(col("sourceId")).as("source_id_split"))
val resultDF = tempDF.select("targetId")
.withColumn("targetAvailable", isAvailable(tempDF.col("targetId")))
.withColumn("sourceAvailable", isAvailable(tempDF.col("source_id_split")))
/*resultDF.select("targetId", "sourceId").
filter(col("targetAvailable") === "true" and col("sourceAvailable")
=== "false").show()*/
// udf to find the availability value for the given id from the mapping table
val isAvailable = udf((searchId: String) => {
val rows = mappingDF.select("available")
.filter(col("id") === searchId).collect()
if (rows(0)(0).toString.equals("true")) "true" else "false" })
Calling isAvailable UDF while calculating the resultDF throws me some weird exception. Am I doing something wrong? is there a better / simpler way to do this?
In your UDF, you are making reference to another dataframe, which is not possible, hence the "weird" exception you obtain.
You want to filter one dataframe based on values contained in another. What you need to do is a join on the id columns. Two joins actually in your case, one for the targets, one for the sources.
The idea to use explode however is very good. Here is a way to achieve what you want:
// generating data, please provide this code next time ;-)
val sourceDF = Seq("USA" -> Seq("China", "Russia", "India", "Japan"),
"China" -> Seq("USA", "Russia", "India"),
"Russia" -> Seq("USA", "Japan"))
.toDF("targetId", "sourceId")
val mappingDF = Seq("USA" -> true, "China" -> false,
"Russia" -> true, "India" -> false,
"Japan" -> true)
.toDF("id", "available")
sourceDF
// we can filter available targets before exploding.
// let's do it to be more efficient.
.join(mappingDF.withColumnRenamed("id", "targetId"), Seq("targetId"))
.where('available)
// exploding the sources
.select('targetId, explode('sourceId) as "sourceId")
// then we keep only non available sources
.join(mappingDF.withColumnRenamed("id", "sourceId"), Seq("sourceId"))
.where(! 'available)
.select("targetId", "sourceId")
.show(false)
which yields
+--------+--------+
|targetId|sourceId|
+--------+--------+
|USA |China |
|USA |India |
+--------+--------+

Error "Invalid call to qualifier on unresolved object" when trying to write a Spark DF into a Hive table

I am having the error "Invalid call to qualifier on unresolved object, tree: 'date1" when trying to write a specific Spark DataFrame into a Hive table.
I am using Spark 2.4.0, but also tested in Spark 2.4.3 with the same result.
I know how to avoid the error, but none of these methods are the expected solution because it modifies the table somehow:
Deleting 1 column from PARTITIONED BY in the create table code.
Moving the position of column text3 just after text2.
In the df, creating boolean1 column as a string type instead of boolean one (don't need to change the type in Hive table).
Deleting 1 of the columns in df and table (df and table must have the same number of columns):
Obviously, deleting date1 column as is the one given by the error message.
Deleting column text3.
Except deleting the date1 column, no other method makes sense to me. I don't understand why by applying any of those options it fixes the problem.
This is a Scala sample code in order to reproduce the error:
// Create a sample dataframe
import spark.implicits._
val df = Seq(("",
"",
false,
"",
"2019-07-01"))
.toDF("text1",
"text2",
"boolean1",
"text3",
"date1")
// df schema:
df.printSchema
root
|-- text1: string (nullable = true)
|-- text2: string (nullable = true)
|-- boolean1: boolean (nullable = false)
|-- text3: string (nullable = true)
|-- date1: string (nullable = true)
// Create the related hive table
spark.sql("drop table if exists table_sample")
spark.sql("""CREATE TABLE `table_sample` (
`text1` STRING,
`text2` STRING,
`boolean1` BOOLEAN,
`text3` STRING,
`date1` DATE
)
USING ORC
PARTITIONED BY (text1, text2)
""")
// Write the sample dataframe to the hive table
df.write
.mode("overwrite")
.format("orc")
.insertInto("table_sample")
I expect the table can be created with no error and without needing to change the table schema (not renaming columns, changing columns types or moving columns into another position).
This is the full stacktrace:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifier on unresolved object, tree: 'trx_business_day_date
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifier(unresolved.scala:107)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$3.apply(package.scala:155)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$3.apply(package.scala:155)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.<init>(package.scala:155)
at org.apache.spark.sql.catalyst.expressions.package$.AttributeSeq(package.scala:98)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.outputAttributes$lzycompute(LogicalPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.outputAttributes(LogicalPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:113)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:81)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:80)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:80)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:198)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:136)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:136)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:54)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:105)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:102)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:102)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:94)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:94)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:136)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:130)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:102)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:79)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:114)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:113)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:113)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:82)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$output$1(QueryExecution.scala:249)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$3.apply(QueryExecution.scala:251)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$3.apply(QueryExecution.scala:251)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:144)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:251)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:90)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:690)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:339)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:325)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:9)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:134)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:136)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:138)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:140)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:142)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:144)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:146)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:148)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:150)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:152)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:154)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:156)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:158)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:160)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:162)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:164)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:166)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:168)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:170)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:172)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:174)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:176)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:178)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:180)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:182)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:184)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:186)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:188)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:190)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:192)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:194)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:196)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:198)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:200)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:202)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:204)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw.<init>(command-1106698602130938:206)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw.<init>(command-1106698602130938:208)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw.<init>(command-1106698602130938:210)
at line694971ab74ea4d1382bf12d864e23292142.$read.<init>(command-1106698602130938:212)
at line694971ab74ea4d1382bf12d864e23292142.$read$.<init>(command-1106698602130938:216)
at line694971ab74ea4d1382bf12d864e23292142.$read$.<clinit>(command-1106698602130938)
at line694971ab74ea4d1382bf12d864e23292142.$eval$.$print$lzycompute(<notebook>:7)
at line694971ab74ea4d1382bf12d864e23292142.$eval$.$print(<notebook>:6)
at line694971ab74ea4d1382bf12d864e23292142.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:199)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:587)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:542)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:324)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:304)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:45)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:268)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:45)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:304)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:475)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:542)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:381)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:328)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)
at java.lang.Thread.run(Thread.java:748)
Anyone knows how can I fix the problem without changing the df/table schema? Can it be a Spark bug?
Thanks in advance!
Real issue is that your code is trying to insert boolean value 'false' in date1 column where date1 is 'Date' type.
Solution for your problem will be moving the partitioned columns as last two columns in dataframe df.
As spark's .insertInto() method treats last columns as partitioned columns. It will not match it on column names in dataframe instead it works on index based column mapping.
Below df works perfectly without changing table structure.
val df = Seq(( false,"","2019-07-01","", ""))
.toDF(
"boolean1", "text3", "date1", "text1", "text2")
Query: "Show create table table_sample" resulted as
CREATE TABLE `table_sample` (`boolean1` BOOLEAN, `text3` STRING, `date1` DATE, `text1` STRING, `text2` STRING)
USING ORC
OPTIONS (
`serialization.format` '1'
)
PARTITIONED BY (text1, text2)
So according to above table structure and your dataframe df the mapping done by spark will be:
`boolean1` BOOLEAN = ""
`text3` STRING = ""
`date1` DATE = false --> Issue is this mapping
`text1` STRING = ""
`text2` STRING = "2019-07-01"
It could be a Spark bug too as it should insert null in date field instead of throwing the error.
I had this exact error when I am trying to create the table while there was a pre-existing table with same name but slightly different schema types. I had to drop the preexisting table to correct this error.

Incorrect conversion of Date (data type) to TimeStamp (data type) while reading from Oracle DB

We are trying to read data from Oracle tables, "Date" based data types are converted into "Timestamp" Data types.
e.g: Table is Oracle.
desc hr.employees;
Name Null? Type
-----------------------------------------
EMPLOYEE_ID NOT NULL NUMBER(6)
FIRST_NAME VARCHAR2(20)
LAST_NAME NOT NULL VARCHAR2(25)
EMAIL NOT NULL VARCHAR2(25)
PHONE_NUMBER VARCHAR2(20)
HIRE_DATE NOT NULL DATE
JOB_ID NOT NULL VARCHAR2(10)
SALARY NUMBER(8,2)
COMMISSION_PCT NUMBER(2,2)
MANAGER_ID NUMBER(6)
DEPARTMENT_ID NUMBER(4)
SSN VARCHAR2(55)
and schema read in the DataFrame in Scala
|-- EMPLOYEE_ID: decimal(6,0) (nullable = false)
|-- FIRST_NAME: string (nullable = true)
|-- LAST_NAME: string (nullable = false)
|-- EMAIL: string (nullable = false)
|-- PHONE_NUMBER: string (nullable = true)
|-- HIRE_DATE: timestamp (nullable = false) (Incorrect data type read here)
|-- JOB_ID: string (nullable = false)
|-- SALARY: decimal(8,2) (nullable = true)
|-- COMMISSION_PCT: decimal(2,2) (nullable = true)
|-- MANAGER_ID: decimal(6,0) (nullable = true)
|-- DEPARTMENT_ID: decimal(4,0) (nullable = true)
|-- SSN: string (nullable = true)
Hire_Date is read incorrectly as TimeStamp, is there a way to correct.
Data is being read from Oracle on the fly and the application does not have an upfront knowledge of datatypes and can't convert it after being read.
Analysis:
As per oracle -
Oracle Database 8i and earlier versions did not support TIMESTAMP
data, but Oracle DATE data used to have a time component as an
extension to the SQL standard. So, Oracle Database 8i and earlier
versions of JDBC drivers mapped oracle.sql.DATE to java.sql.Timestamp
to preserve the time component. Starting with Oracle Database 9.0.1,
TIMESTAMP support was included and 9i JDBC drivers started mapping
oracle.sql.DATE to java.sql.Date. This mapping was incorrect as it
truncated the time component of Oracle DATE data. To overcome this
problem, Oracle Database 11.1 introduced a new flag
mapDateToTimestamp. The default value of this flag is true, which
means that by default the drivers will correctly map oracle.sql.DATE
to java.sql.Timestamp, retaining the time information. If you still
want the incorrect but 10g compatible oracle.sql.DATE to java.sql.Date
mapping, then you can get it by setting the value of
mapDateToTimestamp flag to false.
Ref link is here.
Solution:
So as instructed by oracle provide property jdbc.oracle.mapDateToTimestamp as false -
Class.forName("oracle.jdbc.driver.OracleDriver")
var info : java.util.Properties = new java.util.Properties()
info.put("user", user)
info.put("password", password)
info.put("oracle.jdbc.mapDateToTimestamp", "false")
val jdbcDF = spark.read.jdbc(jdbcURL, tableFullName, info)
Add Oracle database connector jar which supports "oracle.jdbc.mapDateToTimestamp" flag is ojdbc14.jar
Hope it helps!