I have the following code that constructs a VectorAssembler:
val allColsExceptOceanProximity: Array[String] = dfRaw.drop("ocean_proximity").columns
val assembler = new VectorAssembler()
//.setInputCols(Array("longitude", "latitude", "housing_median_age", "total_rooms", "population", "households", "population", "median_income", "median_house_value"))
.setInputCols(allColsExceptOceanProximity)
.setOutputCol("features")
If I use the Array with the columns statically types, it works. When I pass in the Array dynamically, it kind of fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1763.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1763.0 (TID 100292) (192.168.0.35 executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$4540/1415205968: (struct<longitude:double,latitude:double,housing_median_age:double,total_rooms:double,total_bedrooms:double,population:double,households:double,median_income:double,median_house_value:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeysOutput_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Why is this? Any ideas. Is is expecting a Varargs type? I'm on my notebook and do not have the benefits of the IDE.
EDIT: Added the schema:
root
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
|-- housing_median_age: double (nullable = true)
|-- total_rooms: double (nullable = true)
|-- total_bedrooms: double (nullable = true)
|-- population: double (nullable = true)
|-- households: double (nullable = true)
|-- median_income: double (nullable = true)
|-- median_house_value: double (nullable = true)
|-- ocean_proximity: string (nullable = true)
I am trying to read data from Elasticsearch via Spark Scala:
Scala 2.11.8, Spark 2.3.0, Elasticsearch 5.6.8
To Connect -- spark2-shell --jars elasticsearch-spark-20_2.11-5.6.8.jar
val df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", "xxxxxxx").option("es.port", "xxxx").option("es.net.http.auth.user","xxxxx").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.net.http.auth.pass", "xxxxxx").option("es.net.ssl", "true").option("es.nodes.wan.only", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("es.net.ssl.truststore.location", "xxxxx").option("es.net.ssl.truststore.pass", "xxxxx").option("es.read.field.as.array.include","true").option("pushdown", "true").option("es.read.field.as.array.include","a4,a4.a41,a4.a42,a4.a43,a4.a43.a431,a4.a43.a432,a4.a44,a4.a45").load("<index_name>")
Schema as below
|-- a1: string (nullable = true)
|-- a2: string (nullable = true)
|-- a3: struct (nullable = true)
| |-- a31: integer (nullable = true)
| |-- a32: struct (nullable = true)
|-- a4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a41: string (nullable = true)
| | |-- a42: string (nullable = true)
| | |-- a43: struct (nullable = true)
| | | |-- a431: string (nullable = true)
| | | |-- a432: string (nullable = true)
| | |-- a44: string (nullable = true)
| | |-- a45: string (nullable = true)
|-- a8: string (nullable = true)
|-- a9: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a91: string (nullable = true)
| | |-- a92: string (nullable = true)
|-- a10: string (nullable = true)
|-- a11: timestamp (nullable = true)
Though I am able to read data from direct columns and nested schema level 1 (i.e a9 or a3 columns) via command:
df.select(explode($"a9").as("exploded")).select("exploded.*").show
Problem is occuring when I am trying to read a4 elements as its throwing me below error:
[Stage 18:> (0 + 1) / 1]20/02/28 02:43:23 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 54, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$2.apply(CatalystTypeConverters.scala:164)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:164)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:154)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/02/28 02:43:23 ERROR scheduler.TaskSetManager: Task 0 in stage 18.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 57, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
Anything I am doing wrong or any steps I am missing? Please Help
Out of the top of my head, this error occurs when the schema guessed by the spark/ElasticSearch connector is not actually compatible with the data being read.
Keep in my that ES is schemaless, and SparkSQL has a "hard" schema. Bridging this gap is not always possible, so it's all just a best effort.
When connecting the two, the connector samples the documents and tries to guess a schema : "field A is a string, field B is an object structure with two subfield : B.1 being a date, and B.2 being an array of strings, ... whatever".
If it guessed wrong (typically : a given column / subcolumn is guessed as being a String, but in some documents it in fact is an array or a number), then the JSON to SparkSQL conversion emits those kind of errors.
In the words of the documentation, it states :
Elasticsearch treats fields with single or multi-values the same; in fact, the mapping provides no information about this. As a client, it means one cannot tell whether a field is single-valued or not until is actually being read. In most cases this is not an issue and elasticsearch-hadoop automatically creates the necessary list/array on the fly. However in environments with strict schema such as Spark SQL, changing a field actual value from its declared type is not allowed. Worse yet, this information needs to be available even before reading the data. Since the mapping is not conclusive enough, elasticsearch-hadoop allows the user to specify the extra information through field information, specifically es.read.field.as.array.include and es.read.field.as.array.exclude.
So I'd adivse you to check that the schema you reported in your question (the schema guessed by Spark) is actually valid agains all your documents, or not.
If it's not, you have a few options going forward :
Correct the mapping individually. If the problem is linked to an array type not being recognized as such, you can do so using configuration options. You can see the es.read.field.as.array.include (resp. .exclude) option (which is used to actively tell Spark which properties in the documents are array (resp. not array). If a field is unused, es.read.field.exclude is an option that will exclude a given field from Spark altogether, bypassing possible schema issus for it.
If there is no way to provide a valid schema for all cases to ElasticSearch (e.g. some field is sometimes a number, somtimes a string, and there is no way to tell), then basically, you're stuck to going back at the RDD level (and if need be, go back to Dataset / Dataframe once the schema is well defined).
I try to apply kmeans algorithm.
Code
val dfJoin_products_items = df_products.join(df_items, "product_id")
dfJoin_products_items.createGlobalTempView("products_items")
val weightFreight = spark.sql("SELECT cast(product_weight_g as double) weight, cast(freight_value as double) freight FROM global_temp.products_items")
case class Rows(weight:Double, freight:Double)
val rows = weightFreight.as[Rows]
val assembler = new VectorAssembler().setInputCols(Array("weight", "freight")).setOutputCol("features")
val data = assembler.transform(rows)
val kmeans = new KMeans().setK(4)
val model = kmeans.fit(data)
Values
dfJoin_products_items
scala> dfJoin_products_items.printSchema
root
|-- product_id: string (nullable = true)
|-- product_category_name: string (nullable = true)
|-- product_name_lenght: string (nullable = true)
|-- product_description_lenght: string (nullable = true)
|-- product_photos_qty: string (nullable = true)
|-- product_weight_g: string (nullable = true)
|-- product_length_cm: string (nullable = true)
|-- product_height_cm: string (nullable = true)
|-- product_width_cm: string (nullable = true)
|-- order_id: string (nullable = true)
|-- order_item_id: string (nullable = true)
|-- seller_id: string (nullable = true)
|-- shipping_limit_date: string (nullable = true)
|-- price: string (nullable = true)
|-- freight_value: string (nullable = true)
weightFreight
scala> weightFreight.printSchema
root
|-- weight: double (nullable = true)
|-- freight: double (nullable = true)
Error
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_1 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 WARN BlockManager:66 - Block rdd_126_1 could not be removed as it was not found on disk or in memory
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_2 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 ERROR Executor:91 - Exception in task 1.0 in stage 16.0 (TID 23)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
I don't understand this error, someone can explain me please ?
Thanks a lot!
UPDATE 1 : Full stacktrace
The stacktrace is huge, so you can find it here : https://pastebin.com/PhmZPtDk
Getting CastClassException in Spark 2 when dealing with table having complex datatype columns like Array and Array
The actions I tried is simple one: count
df=spark.sql("select * from <tablename>")
df.count
but getting below Error when running the spark application
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, sandbox.hortonworks.com, executor 1): java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
The weird thing is, the same action of dataframe in spark-shell is working fine
Table has below complex columns :
|-- sku_product: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sku_id: string (nullable = true)
| | |-- qty: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- display_name: string (nullable = true)
| | |-- sku_displ_clr_desc: string (nullable = true)
| | |-- sku_sz_desc: string (nullable = true)
| | |-- parent_product_id: string (nullable = true)
| | |-- delivery_mthd: string (nullable = true)
| | |-- pick_up_store_id: string (nullable = true)
| | |-- delivery: string (nullable = true)
|-- hitid_low: string (nullable = true)
|-- evar7: array (nullable = true)
| |-- element: string (containsNull = true)
|-- hitid_high: string (nullable = true)
|-- evar60: array (nullable = true)
| |-- element: string (containsNull = true)
let me know if any further information is needed.
I had a similar problem. I was using spark 2.1 with parquet files.
I discovered that one of the parquet files had a different schema than the others. So when I tried to read all, I got cast error.
In order to resolve it, I just checked file by file.
I'm trying to query two tables from cassandra into two dataframes, then join these two dataframes into one dataframe(result).
I can get the correct result and the spark job could finished normally on eclipse in my computer.
But when I submitted to Spark server (local mode), the job just hang without any exception or error message and couldn't finish after a hour until I press Ctrl+C to stop it.
I have no idea why the job cannot work on spark server, what is the difference between eclipse and spark server. If the reason is a OutofMemory problem, is it possible spark didn't throw any exception and just hang?
Any advice?
Thanks~
Submit command
/usr/bin/spark-submit --class com.test.c2c --jars file:///home/iotcloud/Documents/grace/spark/spark-cassandra-connector-1.6.3-s_2.10.jar file:///home/iotcloud/Documents/grace/spark/C2C_1205.jar
Here is my scala code:
package com.test
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import com.datastax.spark.connector.cql._;
import com.datastax.spark.connector._;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._;
import org.apache.spark.sql.cassandra._;
object c2c {
def main(args: Array[String]) {
println("Start...")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "10.2.1.67")
.setAppName("ConnectToCassandra")
.setMaster("local")
val sc = new SparkContext(conf)
println("Cassandra setting done...")
println("================================================1")
println("Start to save to cassandra...")
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("iot_test")
val df_info = cc.sql("select gatewaymac,sensormac,sensorid,sensorfrequency,status from tsensor_info where gatewaymac != 'failed'")
val df_loc = cc.sql("select sensorlocationid,sensorlocationname,company,plant,department,building,floor,sensorid from tsensorlocation_info where sensorid != 'NULL'")
println("================================================2")
println("registerTmepTable...")
df_info.registerTempTable("info")
df_loc.registerTempTable("loc")
println("================================================4")
println("mapping table...")
println("===info===")
df_info.printSchema()
df_info.take(5).foreach(println)
println("===location===")
df_loc.printSchema()
df_loc.take(5).foreach(println)
println("================================================5")
println("print mapping result")
val result = df_info.join(df_loc, "sensorid")
result.registerTempTable("ref")
result.printSchema()
result.take(5).foreach(println)
println("====Finish====")
sc.stop()
}
}
Normal result on Eclipse
Cassandra setting done...
================================================1
Start to save to cassandra...
================================================2
registerTmepTable...
================================================4
mapping table...
===info===
root
|-- gatewaymac: string (nullable = true)
|-- sensormac: string (nullable = true)
|-- sensorid: string (nullable = true)
|-- sensorfrequency: string (nullable = true)
|-- status: string (nullable = true)
[0000aaaaaaaaat7f,e9d050f0ebc25000 ,0000aaaaaaaaat7f3242,null,N]
[000000000000b219,c879b4f921c25000 ,000000000000b2193590,00:01,N]
[0000aaaaaaaaaabb,2c153cf9f0c25000 ,0000aaaaaaaaaabba353,null,Y]
[000000000000a412,17da712795c25000 ,000000000000a4126156,00:05,Y]
[000000000000a104,b2a4b8b7a6c25000 ,000000000000a1046340,00:01,N]
===location===
root
|-- sensorlocationid: string (nullable = true)
|-- sensorlocationname: string (nullable = true)
|-- company: string (nullable = true)
|-- plant: string (nullable = true)
|-- department: string (nullable = true)
|-- building: string (nullable = true)
|-- floor: string (nullable = true)
|-- sensorid: string (nullable = true)
[JA092,A1F-G-L00-S066,IAC,IACJ,MT,A,1,000000000000a108a19f]
[JA044,A2F-I-L00-S037,IAC,IACJ,MT,A,2,000000000000a2024246]
[JA111,A2F-C-L00-S076,IAC,IACJ,MPA,A,2,000000000000a210c710]
[PA041,A1F-SMT-S03,IAC,IACP,SMT,A,1,000000000000a10354c1]
[PC010,C3F-IQC-S03,IAC,IACP,IQC,C,3,000000000000c3269786]
================================================5
print mapping result
root
|-- sensorid: string (nullable = true)
|-- gatewaymac: string (nullable = true)
|-- sensormac: string (nullable = true)
|-- sensorfrequency: string (nullable = true)
|-- status: string (nullable = true)
|-- sensorlocationid: string (nullable = true)
|-- sensorlocationname: string (nullable = true)
|-- company: string (nullable = true)
|-- plant: string (nullable = true)
|-- department: string (nullable = true)
|-- building: string (nullable = true)
|-- floor: string (nullable = true)
[000000000000a10275bc,000000000000a102,e85ce9b9d2c25000 ,00:05,Y,PA030,A1F-WM-S02,IAC,IACP,WM,A,1]
[000000000000b117160c,000000000000b117,33915a79e5c25000 ,00:05,Y,PB011,B1F-WM-S01,IAC,IACP,WM,B,1]
[000000000000a309024b,000000000000a309,afdab2efbbc25000 ,00:00,N,PA101,A3F-MP6-R01,IAC,IACP,MP6,A,3]
[000000000000c6294109,000000000000c629,383cca8e45c25000 ,00:05,Y,PC017,C6F-WM-S01,IAC,IACP,WM,C,6]
[000000000000a205e52e,000000000000a205,8d83303cf4c25000 ,00:00,N,PA063,A2F-MP6-R04,IAC,IACP,MP6,A,2]
====Finish====
Finally found the answer is that I forgot to set master to spark standalone cluster.
I submitted spark job on local spark server.
After setting master to spark standalone cluster, the job works fine. Maybe it's because the local spark server doesn't have enough cores to execute the task. (It's an old machine with 2 cores. By the way, spark standalone cluster have 4 nodes, and they are all old machines.)