I have a table with dates and comments.
dob | comment
---------------------------
1960-12-01 | this is useful
And I want a new column with this type:
value_type = T.StructType(
[
T.StructField("extra",T.MapType(T.StringType(), T.StringType(), True), True),
T.StructField("date", T.StringType(), True),
T.StructField("from_date", T.StringType(), True),
T.StructField("to_date", T.StringType(), True),
T.StructField("value", T.StringType(), True),
]
)
I need to:
put the df.date into the date field of the struct and
put the df.comment into the extra map of the struct
thanks to blackbishop I figured out how to do the first part here - and i tried to use .withField() to update the map but it throws an error:
I tried:
(df
.withColumn("new_col",
F.struct(*[F.lit(None).cast(f.dataType).alias(f.name)
for f in value_type.fields]))
.withColumn("new_col", (F.col("new_col")
.withField("date", F.col("dob"))
.withField("extra.value", F.col("comment")))))
But I get the following error:
AnalysisException: cannot resolve 'update_fields(update_fields(new_col, WithField(dob), WithField(dob)).extra, WithField(dob))' due to data type mismatch: struct argument should be struct type, got: map<string,string>;
I am confused as per why it would not work with the map inside the struct?
Thanks :)
I figured it out!
(df
.withColumn("new_col",
F.struct(*[F.lit(None).cast(f.dataType).alias(f.name)
for f in value_type.fields]))
.withColumn("new_col", (F.col("new_col")
.withField("date", F.col("dob"))
.withField("extra",
F.create_map(F.lit("my_key"), F.col("comment")))))
The problem was that I was not actually passing a map to a map type!
Related
I want to verify the schema of a Spark dataframe against schema information it get from some other source (a dashboard tool). The information I get about the table is field name and field type (nullability is not important at this point).
However, for DecimalType columns I do not get the information about precision and scale (the two parameters of DecimalType). So I have to ignore these values in the comparison.
I currently re-write the schema so that the Decimal columns become Float column. But is there a more elegant way to do that?
Basically I want to write a function is_schema_valid() that works as such:
from pyspark.sql import types as T
df_schema = T.StructType([
T.StructField('column_1', T.StringType(), True),
T.StructField('column_2', T.DecimalType(20,5), True), # values in DecimalType are random
])
schema_info = [('column_1', 'String'), ('column_2', 'Decimal')]
is_schema_valid(schema_info, df_schema)
# Output: True
The best would probably to compare similar objects. You can transform a schema in JSON object (or python dict).
import json
_df_schema_dict = json.loads(df_schema.json())
df_schema_dict = {
field["name"]: field["type"]
for field in _df_schema_dict["fields"]
}
df_schema_dict
> {'column_1': 'string', 'column_2': 'decimal(20,5)'}
You can work with this object to compare with schema_info. Here is a very basic test you can do (I change a bit the content of schema_info):
import json
def is_schema_valid(schema_info, df_schema):
df_schema_dict = {
field["name"]: field["type"] for field in json.loads(df_schema.json())["fields"]
}
schema_info_dict = {elt[0]: elt[1] for elt in schema_info}
return schema_info_dict == df_schema_dict
df_schema = T.StructType(
[
T.StructField("column_1", T.StringType(), True),
T.StructField("column_2", T.DecimalType(20, 5), True),
]
)
schema_info = [("column_1", "string"), ("column_2", "decimal(20,5)")]
is_schema_valid(schema_info, df_schema)
# True
If you want to ignore decimal precision, you can always twist a little bit the dataframe schema. Replace field["type"] with field["type"] if "decimal" not in field["type"] else "decimal" for example.
import json
def is_schema_valid(schema_info, df_schema):
df_schema_dict = {
field["name"]: field["type"] if "decimal" not in field["type"] else "decimal"
for field in json.loads(df_schema.json())["fields"]
}
schema_info_dict = {elt[0]: elt[1] for elt in schema_info}
return schema_info_dict == df_schema_dict
df_schema = T.StructType(
[
T.StructField("column_1", T.StringType(), True),
T.StructField("column_2", T.DecimalType(20, 5), True),
]
)
schema_info = [("column_1", "string"), ("column_2", "decimal")]
is_schema_valid(schema_info, df_schema)
# True
I'm working with some Yelp data. Here is how a row from the provided .json file looks like:
{"business_id":"f9NumwFMBDn751xgFiRbNA",
"name":"The Range At Lake Norman",
"address":"10913 Bailey Rd",
"city":"Cornelius",
"state":"NC",
"postal_code":"28031",
"latitude":35.4627242,
"longitude":-80.8526119,
"stars":3.5,
"review_count":36,
"is_open":1,
"attributes":{"BusinessAcceptsCreditCards":"True","BikeParking":"True","GoodForKids":"False","BusinessParking":"{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
"ByAppointmentOnly":"False","RestaurantsPriceRange2":"3"},
"categories":"Active Life, Gun\/Rifle Ranges, Guns & Ammo, Shopping",
"hours":{"Monday":"10:0-18:0","Tuesday":"11:0-20:0","Wednesday":"10:0-18:0","Thursday":"11:0-20:0","Friday":"11:0-20:0","Saturday":"11:0-20:0","Sunday":"13:0-18:0"}}
What I'd like to do is to access the "BikeParking" attribute within the "attributes" column and filer based off its value. Right now I have something like:
df.filter(functions.explode(df['attributes']).BikeParking == False)
This, however, returns the following error:
"pyspark.sql.utils.AnalysisException: Generators are not supported outside the SELECT clause, but got: 'Filter (explode(attributes#8)[GoodForKids] = false);"
How would I be able to achieve this?
attributes column is of struct type and explode only works with array or map types.
Try by accessing struct fields as <field_name>.*
Example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark.\
read.\
option("multiline","true").\
json("yelp.json").\
filter(col("attributes.BikeParking").cast("boolean")).\
show()
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+
#| address| attributes| business_id| categories| city| hours|is_open| latitude| longitude| name|postal_code|review_count|stars|state|
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+
#|10913 Bailey Rd|[True, True, {'ga...|f9NumwFMBDn751xgF...|Active Life, Gun/...|Cornelius|[11:0-20:0, 10:0-...| 1|35.4627242|-80.8526119|The Range At Lake...| 28031| 36| 3.5| NC|
#+---------------+--------------------+--------------------+--------------------+---------+--------------------+-------+----------+-----------+--------------------+-----------+------------+-----+-----+
Assuming data frame 1 represents target country and list of source countries and data frame 2 represents the availability for all the countries, find all the pairs from data frame 1 where target country mapping is TRUE and source country mapping is FALSE:
Dataframe 1 (targetId, sourceId):
USA: China, Russia, India, Japan
China: USA, Russia, India
Russia: USA, Japan
Dataframe 2 (id, available):
USA: true
China: false
Russia: true
India: false
Japan: true
Result Dataset should look like:
(USA, China),
(USA, India)
My idea is to first explode the data set1, create new data frame (say, tempDF), add 2 new columns to it: targetAvailable, sourceAvailable and finally filter for targetAvailable = false and sourceAvailable = true to get the desired result data frame.
Below is the snippet of my code:
val sourceDF = sourceData.toDF("targetId", "sourceId")
val mappingDF = mappingData.toDF("id", "available")
val tempDF = sourceDF.select(col("targetId"),
explode(col("sourceId")).as("source_id_split"))
val resultDF = tempDF.select("targetId")
.withColumn("targetAvailable", isAvailable(tempDF.col("targetId")))
.withColumn("sourceAvailable", isAvailable(tempDF.col("source_id_split")))
/*resultDF.select("targetId", "sourceId").
filter(col("targetAvailable") === "true" and col("sourceAvailable")
=== "false").show()*/
// udf to find the availability value for the given id from the mapping table
val isAvailable = udf((searchId: String) => {
val rows = mappingDF.select("available")
.filter(col("id") === searchId).collect()
if (rows(0)(0).toString.equals("true")) "true" else "false" })
Calling isAvailable UDF while calculating the resultDF throws me some weird exception. Am I doing something wrong? is there a better / simpler way to do this?
In your UDF, you are making reference to another dataframe, which is not possible, hence the "weird" exception you obtain.
You want to filter one dataframe based on values contained in another. What you need to do is a join on the id columns. Two joins actually in your case, one for the targets, one for the sources.
The idea to use explode however is very good. Here is a way to achieve what you want:
// generating data, please provide this code next time ;-)
val sourceDF = Seq("USA" -> Seq("China", "Russia", "India", "Japan"),
"China" -> Seq("USA", "Russia", "India"),
"Russia" -> Seq("USA", "Japan"))
.toDF("targetId", "sourceId")
val mappingDF = Seq("USA" -> true, "China" -> false,
"Russia" -> true, "India" -> false,
"Japan" -> true)
.toDF("id", "available")
sourceDF
// we can filter available targets before exploding.
// let's do it to be more efficient.
.join(mappingDF.withColumnRenamed("id", "targetId"), Seq("targetId"))
.where('available)
// exploding the sources
.select('targetId, explode('sourceId) as "sourceId")
// then we keep only non available sources
.join(mappingDF.withColumnRenamed("id", "sourceId"), Seq("sourceId"))
.where(! 'available)
.select("targetId", "sourceId")
.show(false)
which yields
+--------+--------+
|targetId|sourceId|
+--------+--------+
|USA |China |
|USA |India |
+--------+--------+
I am having the error "Invalid call to qualifier on unresolved object, tree: 'date1" when trying to write a specific Spark DataFrame into a Hive table.
I am using Spark 2.4.0, but also tested in Spark 2.4.3 with the same result.
I know how to avoid the error, but none of these methods are the expected solution because it modifies the table somehow:
Deleting 1 column from PARTITIONED BY in the create table code.
Moving the position of column text3 just after text2.
In the df, creating boolean1 column as a string type instead of boolean one (don't need to change the type in Hive table).
Deleting 1 of the columns in df and table (df and table must have the same number of columns):
Obviously, deleting date1 column as is the one given by the error message.
Deleting column text3.
Except deleting the date1 column, no other method makes sense to me. I don't understand why by applying any of those options it fixes the problem.
This is a Scala sample code in order to reproduce the error:
// Create a sample dataframe
import spark.implicits._
val df = Seq(("",
"",
false,
"",
"2019-07-01"))
.toDF("text1",
"text2",
"boolean1",
"text3",
"date1")
// df schema:
df.printSchema
root
|-- text1: string (nullable = true)
|-- text2: string (nullable = true)
|-- boolean1: boolean (nullable = false)
|-- text3: string (nullable = true)
|-- date1: string (nullable = true)
// Create the related hive table
spark.sql("drop table if exists table_sample")
spark.sql("""CREATE TABLE `table_sample` (
`text1` STRING,
`text2` STRING,
`boolean1` BOOLEAN,
`text3` STRING,
`date1` DATE
)
USING ORC
PARTITIONED BY (text1, text2)
""")
// Write the sample dataframe to the hive table
df.write
.mode("overwrite")
.format("orc")
.insertInto("table_sample")
I expect the table can be created with no error and without needing to change the table schema (not renaming columns, changing columns types or moving columns into another position).
This is the full stacktrace:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifier on unresolved object, tree: 'trx_business_day_date
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifier(unresolved.scala:107)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$3.apply(package.scala:155)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$3.apply(package.scala:155)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.<init>(package.scala:155)
at org.apache.spark.sql.catalyst.expressions.package$.AttributeSeq(package.scala:98)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.outputAttributes$lzycompute(LogicalPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.outputAttributes(LogicalPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:113)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:81)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:80)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:80)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:198)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:136)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:136)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:54)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:105)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:102)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:102)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:94)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:94)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:136)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:130)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:102)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:79)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:114)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:113)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:113)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:82)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$output$1(QueryExecution.scala:249)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$3.apply(QueryExecution.scala:251)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$3.apply(QueryExecution.scala:251)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:144)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:251)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:90)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:690)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:339)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:325)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:9)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:134)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:136)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:138)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:140)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:142)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:144)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:146)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:148)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:150)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:152)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:154)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:156)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:158)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:160)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:162)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:164)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:166)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:168)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:170)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:172)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:174)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:176)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:178)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:180)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:182)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:184)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:186)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:188)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:190)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:192)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:194)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:196)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:198)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:200)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:202)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw$$iw.<init>(command-1106698602130938:204)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw$$iw.<init>(command-1106698602130938:206)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw$$iw.<init>(command-1106698602130938:208)
at line694971ab74ea4d1382bf12d864e23292142.$read$$iw.<init>(command-1106698602130938:210)
at line694971ab74ea4d1382bf12d864e23292142.$read.<init>(command-1106698602130938:212)
at line694971ab74ea4d1382bf12d864e23292142.$read$.<init>(command-1106698602130938:216)
at line694971ab74ea4d1382bf12d864e23292142.$read$.<clinit>(command-1106698602130938)
at line694971ab74ea4d1382bf12d864e23292142.$eval$.$print$lzycompute(<notebook>:7)
at line694971ab74ea4d1382bf12d864e23292142.$eval$.$print(<notebook>:6)
at line694971ab74ea4d1382bf12d864e23292142.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:199)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:587)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:542)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:324)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:304)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:45)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:268)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:45)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:304)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:475)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:542)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:381)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:328)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)
at java.lang.Thread.run(Thread.java:748)
Anyone knows how can I fix the problem without changing the df/table schema? Can it be a Spark bug?
Thanks in advance!
Real issue is that your code is trying to insert boolean value 'false' in date1 column where date1 is 'Date' type.
Solution for your problem will be moving the partitioned columns as last two columns in dataframe df.
As spark's .insertInto() method treats last columns as partitioned columns. It will not match it on column names in dataframe instead it works on index based column mapping.
Below df works perfectly without changing table structure.
val df = Seq(( false,"","2019-07-01","", ""))
.toDF(
"boolean1", "text3", "date1", "text1", "text2")
Query: "Show create table table_sample" resulted as
CREATE TABLE `table_sample` (`boolean1` BOOLEAN, `text3` STRING, `date1` DATE, `text1` STRING, `text2` STRING)
USING ORC
OPTIONS (
`serialization.format` '1'
)
PARTITIONED BY (text1, text2)
So according to above table structure and your dataframe df the mapping done by spark will be:
`boolean1` BOOLEAN = ""
`text3` STRING = ""
`date1` DATE = false --> Issue is this mapping
`text1` STRING = ""
`text2` STRING = "2019-07-01"
It could be a Spark bug too as it should insert null in date field instead of throwing the error.
I had this exact error when I am trying to create the table while there was a pre-existing table with same name but slightly different schema types. I had to drop the preexisting table to correct this error.
I have this simple code
var count = event_stream
.groupBy("value").count()
event_stream.join(count,"value").printSchema() //get error on this line
event_stream and count schemas are as follows
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- value: binary (nullable = true)
|-- count: long (nullable = false)
two questions
Why do I get this error and how to fix?
Why does groupby.count drops all other columns?
The error is as follows
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- AnalysisBarrier
: +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#7f2c57fe, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#3dbd7107,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
+- AnalysisBarrier
+- Aggregate [value#8], [value#8, count(1) AS count#46L]
+- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#7f2c57fe, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#3dbd7107,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
Conflicting attributes: value#8
EDIT: yes! the changing the name of the columns works.
But now, If I use the join, I have to use OutputMode.Append and for that, I need to add Watermarks to the stream.
What I want is extract the count and topic(from the above printed schema) in the resultingDF and write that to some Sink.
Two questions
Is there any other way/better way to do this?
Can I do multiple aggs like count() and then also add another column which is of String type i.e. topic is this case?
Why do I get this error and how to fix?
I think you are getting the error because the final joined schema contains two value fields, one from each side of the join. To fix this rename the "value" field on one of the two joined Streams like this:
var count = event_stream.
groupBy("value").count().
withColumnRenamed("value", "join_id")
event_stream.join(count, $"value" === $"join_id").
drop("join_id").
printSchema()
Why does groupby.count drops all other columns?
groupBy operations basically are dividing your fields up into two lists. A list of fields to use as the key, and a list of fields to aggregate. The key fields just show up as is in the final result, but any fields not in the list need to have an aggregate operation defined to show up in the result. Otherwise spark has no way to know how you want to combine multiple values of that field! Did you want to just count it? Did you want the max value? Did you want to see all distinct values? To specify how to rollup a field, you can define it in a .agg(..) call.
Example:
val input = Seq(
(1, "Bob", 4),
(1, "John", 5)
).toDF("key", "name", "number")
input.groupBy("key").
agg(collect_set("name") as "names",
max("number") as "maxnum").
show
+---+-----------+------+
|key|name |maxnum|
+---+-----------+------+
| 1|[Bob, John]| 5|
+---+-----------+------+
The reason for error is the column name which is used for joining .
You can use the operation like .
var count = event_stream
.groupBy("value").count()
event_stream.join(count,Seq("value"))