I have the following code that constructs a VectorAssembler:
val allColsExceptOceanProximity: Array[String] = dfRaw.drop("ocean_proximity").columns
val assembler = new VectorAssembler()
//.setInputCols(Array("longitude", "latitude", "housing_median_age", "total_rooms", "population", "households", "population", "median_income", "median_house_value"))
.setInputCols(allColsExceptOceanProximity)
.setOutputCol("features")
If I use the Array with the columns statically types, it works. When I pass in the Array dynamically, it kind of fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1763.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1763.0 (TID 100292) (192.168.0.35 executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$4540/1415205968: (struct<longitude:double,latitude:double,housing_median_age:double,total_rooms:double,total_bedrooms:double,population:double,households:double,median_income:double,median_house_value:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeysOutput_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Why is this? Any ideas. Is is expecting a Varargs type? I'm on my notebook and do not have the benefits of the IDE.
EDIT: Added the schema:
root
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
|-- housing_median_age: double (nullable = true)
|-- total_rooms: double (nullable = true)
|-- total_bedrooms: double (nullable = true)
|-- population: double (nullable = true)
|-- households: double (nullable = true)
|-- median_income: double (nullable = true)
|-- median_house_value: double (nullable = true)
|-- ocean_proximity: string (nullable = true)
I am trying to use SVMWithSGD to train my model, but I encounter ClassCastException while trying to access my training.
My train_data dataframe schema looks like :
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
I created an LabeledPoint RDD to use it on SVNWithSGD
val targetInd = train_data.columns.indexOf("label_index")`
val featInd = Array("features").map(train_data.columns.indexOf(_))
val train_lp = train_data.rdd.map(r => LabeledPoint( r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
But When I call
SVMWithSGD.train(train_lp, numIterations)
it gives me :
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSched
uler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSche
duler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSche
duler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:
59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.appl
y(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.appl
y(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.sc
ala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSche
duler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.generateInitia
lWeights(GeneralizedLinearAlgorithm.scala:204)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(Generalize
dLinearAlgorithm.scala:234)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:217)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:255)
... 55 elided
Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to org.
apache.spark.mllib.linalg.Vector
My train_data was created based on label (file_name) and features (json file representing images features).
Try using this -
Schema
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
Modify your code as-
val train_lp = train_data.rdd.map(r => LabeledPoint(r.getAs("label_index"), r.getAs("features")))
I have a dataframe with Schema :
root
|-- QUERY: string (nullable = true)
|-- TYPE: string (nullable = true)
|-- DEVICE: string (nullable = true)
|-- PURCHASE_UNITS_SUM: double (nullable = true)
|-- CLICK_SUM: decimal(38,18) (nullable = true)
|-- IMPRESSION_COUNT: long (nullable = false)
|-- CLICK_THROUGH_RATE: decimal(38,2) (nullable = true)
|-- PURCHASE_RATE: double (nullable = true)
I am trying to convert some columns to map (device -> columns) :
val result = df.withColumn("CLICK_THROUGH_RATE_MAP",
map(col("DEVICE"), col("CLICK_THROUGH_RATE")))
.withColumn("PURCHASE_RATE_MAP",
map(col("DEVICE"), col("PURCHASE_RATE")))
.withColumn("PURCHASE_SUM_MAP",
map(col("DEVICE"), col("PURCHASE_UNITS_SUM")))
.withColumn("CLICK_SUM_MAP",
map(col("DEVICE"), col("CLICK_SUM")))
.withColumn("IMPRESSION_SUM_MAP",
map(col("DEVICE"), col("IMPRESSION_COUNT")))
.groupBy("QUERY", "TYPE")
.agg(collect_list("CLICK_THROUGH_RATE_MAP"),
collect_list("PURCHASE_RATE_MAP"),
collect_list("PURCHASE_SUM_MAP"),
collect_list("CLICK_SUM_MAP"),
collect_list("IMPRESSION_SUM_MAP"))
.as[(String, String,
Seq[Map[String, Double]],
Seq[Map[String, Double]],
Seq[Map[String, Double]],
Seq[Map[String, Double]],
Seq[Map[String, Double]])]
.map {
case (query, type, list1, list2, list3, list4, list5) =>
(query, type,
list1.reduce(_ ++ _),
list2.reduce(_ ++ _),
list3.reduce(_ ++ _),
list4.reduce(_ ++ _),
list5.reduce(_ ++ _))
}.
toDF("QUERY",
"TYPE",
"CLICK_THROUGH_RATE",
"PURCHASE_RATE",
"PURCHASE_UNITS",
"CLICKS",
"IMPRESSIONS")
}
This gives me -
root
|-- QUERY: string (nullable = true)
|-- TYPE: string (nullable = true)
|-- CLICK_THROUGH_RATE: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- PURCHASE_RATE: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- PURCHASE_UNITS: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- CLICKS: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- IMPRESSIONS: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
But when I do result.count, I am getting this exception -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 4 times, most recent failure: Lost task 0.3 in stage 63.0 (TID 62365, ip-10-0-1-52.ec2.internal, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2347)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:464)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:490)
at sun.reflect.GeneratedMethodAccessor232.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2232)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:464)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:401)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
at org.apache.spark.sql.Dataset.show(Dataset.scala:730)
... 53 elided
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2347)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:464)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:490)
at sun.reflect.GeneratedMethodAccessor232.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2232)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2341)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2265)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2123)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:464)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Am I doing something wrong ?
There is the same problem with HashMap.
I found the solution here : https://gist.github.com/ramn/5566596
You have to replace the class ObjectInputStream in your code by a new class : ObjectInputStreamWithCustomClassLoader
class ObjectInputStreamWithCustomClassLoader(
fileInputStream: FileInputStream
) extends ObjectInputStream(fileInputStream) {
override def resolveClass(desc: java.io.ObjectStreamClass): Class[_] = {
try { Class.forName(desc.getName, false, getClass.getClassLoader) }
catch { case ex: ClassNotFoundException => super.resolveClass(desc) }
}
}
I changed your code a little bit and I am getting the result
Created a dataframe with a single record with same schema as yours
val df = Seq(("select * from test", "type1", "device1", "10.0", "20.0", "1234", "23.4567", "10.98")).toDF.selectExpr("_1 as QUERY", "_2 as TYPE", "_3 as DEVICE", "_4 as PURCHASE_UNITS_SUM", "_5 as CLICK_SUM", "_6 as IMPRESSION_COUNT", "_7 as CLICK_THROUGH_RATE", "_8 as PURCHASE_RATE")
Below is the Schema and the sample row
root
|-- QUERY: string (nullable = true)
|-- TYPE: string (nullable = true)
|-- DEVICE: string (nullable = true)
|-- PURCHASE_UNITS_SUM: string (nullable = true)
|-- CLICK_SUM: string (nullable = true)
|-- IMPRESSION_COUNT: string (nullable = true)
|-- CLICK_THROUGH_RATE: string (nullable = true)
|-- PURCHASE_RATE: string (nullable = true)
+------------------+-----+-------+------------------+---------+----------------+------------------+-------------+
| QUERY| TYPE| DEVICE|PURCHASE_UNITS_SUM|CLICK_SUM|IMPRESSION_COUNT|CLICK_THROUGH_RATE|PURCHASE_RATE|
+------------------+-----+-------+------------------+---------+----------------+------------------+-------------+
|select * from test|type1|device1| 10.0| 20.0| 1234| 23.4567| 10.98|
+------------------+-----+-------+------------------+---------+----------------+------------------+-------------+
val result = df.withColumn("CLICK_THROUGH_RATE_MAP", map(col("DEVICE"), col("CLICK_THROUGH_RATE"))).
withColumn("PURCHASE_RATE_MAP", map(col("DEVICE"), col("PURCHASE_RATE"))).
withColumn("PURCHASE_SUM_MAP", map(col("DEVICE"), col("PURCHASE_UNITS_SUM"))).
withColumn("CLICK_SUM_MAP", map(col("DEVICE"), col("CLICK_SUM"))).
withColumn("IMPRESSION_SUM_MAP", map(col("DEVICE"), col("IMPRESSION_COUNT"))).
groupBy("QUERY", "TYPE").
agg(collect_list("CLICK_THROUGH_RATE_MAP"), collect_list("PURCHASE_RATE_MAP"), collect_list("PURCHASE_SUM_MAP"), collect_list("CLICK_SUM_MAP"), collect_list("IMPRESSION_SUM_MAP")).
as[(String, String, Seq[scala.collection.immutable.Map[String, Double]], Seq[scala.collection.immutable.Map[String, Double]], Seq[scala.collection.immutable.Map[String, Double]], Seq[scala.collection.immutable.Map[String, Double]], Seq[scala.collection.immutable.Map[String, Double]])]
result.show
+------------------+-----+------------------------------------+-------------------------------+------------------------------+---------------------------+--------------------------------+
| QUERY| TYPE|collect_list(CLICK_THROUGH_RATE_MAP)|collect_list(PURCHASE_RATE_MAP)|collect_list(PURCHASE_SUM_MAP)|collect_list(CLICK_SUM_MAP)|collect_list(IMPRESSION_SUM_MAP)|
+------------------+-----+------------------------------------+-------------------------------+------------------------------+---------------------------+--------------------------------+
|select * from test|type1| [Map(device1 -> 2...| [Map(device1 -> 1...| [Map(device1 -> 1...| [Map(device1 -> 2...| [Map(device1 -> 1...|
+------------------+-----+------------------------------------+-------------------------------+------------------------------+---------------------------+--------------------------------+
I changed the map function as follows
val finalresultdf = result.map { f => (f._1, f._2, f._3.reduce(_ ++ _), f._4.reduce(_ ++ _), f._5.reduce(_ ++ _), f._6.reduce(_ ++ _), f._7.reduce(_ ++ _)) }.
toDF("QUERY", "TYPE", "CLICK_THROUGH_RATE", "PURCHASE_RATE", "PURCHASE_UNITS", "CLICKS", "IMPRESSIONS")
finalresultdf.show
+------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
| QUERY| TYPE| CLICK_THROUGH_RATE| PURCHASE_RATE| PURCHASE_UNITS| CLICKS| IMPRESSIONS|
+------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|select * from test|type1|Map(device1 -> 23...|Map(device1 -> 10...|Map(device1 -> 10.0)|Map(device1 -> 20.0)|Map(device1 -> 12...|
+------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
finalresultdf.count
scala> finalresultdf.count
res34: Long = 1
Hope this helps!!!
Getting CastClassException in Spark 2 when dealing with table having complex datatype columns like Array and Array
The actions I tried is simple one: count
df=spark.sql("select * from <tablename>")
df.count
but getting below Error when running the spark application
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, sandbox.hortonworks.com, executor 1): java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
The weird thing is, the same action of dataframe in spark-shell is working fine
Table has below complex columns :
|-- sku_product: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sku_id: string (nullable = true)
| | |-- qty: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- display_name: string (nullable = true)
| | |-- sku_displ_clr_desc: string (nullable = true)
| | |-- sku_sz_desc: string (nullable = true)
| | |-- parent_product_id: string (nullable = true)
| | |-- delivery_mthd: string (nullable = true)
| | |-- pick_up_store_id: string (nullable = true)
| | |-- delivery: string (nullable = true)
|-- hitid_low: string (nullable = true)
|-- evar7: array (nullable = true)
| |-- element: string (containsNull = true)
|-- hitid_high: string (nullable = true)
|-- evar60: array (nullable = true)
| |-- element: string (containsNull = true)
let me know if any further information is needed.
I had a similar problem. I was using spark 2.1 with parquet files.
I discovered that one of the parquet files had a different schema than the others. So when I tried to read all, I got cast error.
In order to resolve it, I just checked file by file.
I'm having an issue trying to show data that's in my DF.(or more generally put, i want to visualize the data. Preferably with the SQL interperter)
I've gotten a flat text file (log excerpt) that i'm formatting through a class and then creating a DF based on the result.
The data itself looks like this (single line excerpt) :
2017-02-09 15:56:15,411 INFO [pool-2-thread-1] n.t.f.c.LoggingConsumer [LoggingConsumer.java:27] [Floor=b827eb074534, tile=00051] Data received: '[false, false, false, false, false, false, false, true]', batch size= 0
Here's my code:
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
import org.apache.spark.sql.SparkSession
import sqlContext.implicits._
val realdata = sc.textFile("/root/application.txt")
case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, sensor1: String, sensor2: String, sensor3: String, sensor4: String, sensor5: String, sensor6: String, sensor7: String, sensor8: String, batchsize: String, troepje1: String, troepje2: String)
val mapData = realdata
.filter(line => line.contains("data") && line.contains("INFO"))
.map(s => s.split(" ").toList)
.map(
s => testClass(s(0),
s(1).split(",")(0),
s(1).split(",")(1),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(15),
s(16),
s(17),
s(18),
s(19),
s(20),
s(21),
s(22),
"",
"",
""
)
).toDF
When i print the resulting schema it seems to do what i want:
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- level: string (nullable = true)
|-- unknown1: string (nullable = true)
|-- unknownConsumer: string (nullable = true)
|-- unknownConsumer2: string (nullable = true)
|-- vloer: string (nullable = true)
|-- tegel: string (nullable = true)
|-- msg: string (nullable = true)
|-- sensor1: string (nullable = true)
|-- sensor2: string (nullable = true)
|-- sensor3: string (nullable = true)
|-- sensor4: string (nullable = true)
|-- sensor5: string (nullable = true)
|-- sensor6: string (nullable = true)
|-- sensor7: string (nullable = true)
|-- sensor8: string (nullable = true)
|-- batchsize: string (nullable = true)
|-- troepje1: string (nullable = true)
|-- troepje2: string (nullable = true)
Now granted, it's not very pretty (and i do apologise for the dutch) but that seems to work.
But when i do a mapData.show or anything like that i get:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:503)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:508)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:500)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:257)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
... 56 elided
Caused by: java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:503)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:508)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:500)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:257)
... 3 more
Is there anybody that can tell me what i'm doing wrong or if there's a better way to visualize the data ?
Thank you so much for any help or guidance.