I am attempting to save a Spark DataFrame as a CSV. I have looked up numerous different posts and guides and am, for some reason, still getting an issue. The code I am using to do this is
endframe.coalesce(1).
write.
mode("append").
csv("file:///home/X/Code/output/output.csv")
I have also tried this by including .format("com.databricks.spark.csv") as well as by changing the .csv() to a .save() and strangely none of these work. The most unusual part is that running this code creates an empty folder called "output.csv" in the output folder.
The error message that spark gives is
Job aborted due to stage failure:
Task 0 in stage 281.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 281.0 (TID 22683, X.x.local, executor 4): org.apache.spark.SparkException:
Task failed while writing rows.
I have verified that the dataframe schema is properly initialized. However, when I use the .format, I do not import com.databricks.spark.csv, but I do not think that is the problem. Any advice on this would be appreciated.
The schema is as follows:
|-- kwh: double (nullable = true)
|-- qh_end: double (nullable = true)
|-- cdh70: double (nullable = true)
|-- norm_hbu: double (nullable = true)
|-- precool_counterprecoolevent_id6: double (nullable = true)
|-- precool_counterprecoolevent_id7: double (nullable = true)
|-- precool_counterprecoolevent_id8: double (nullable = true)
|-- precool_counterprecoolevent_id9: double (nullable = true)
|-- event_id10event_counterevent: double (nullable = true)
|-- event_id2event_counterevent: double (nullable = true)
|-- event_id3event_counterevent: double (nullable = true)
|-- event_id4event_counterevent: double (nullable = true)
|-- event_id5event_counterevent: double (nullable = true)
|-- event_id6event_counterevent: double (nullable = true)
|-- event_id7event_counterevent: double (nullable = true)
|-- event_id8event_counterevent: double (nullable = true)
|-- event_id9event_counterevent: double (nullable = true)
|-- event_idTestevent_counterevent: double (nullable = true)
|-- event_id10snapback_countersnapback: double (nullable = true)
|-- event_id2snapback_countersnapback: double (nullable = true)
|-- event_id3snapback_countersnapback: double (nullable = true)
|-- event_id4snapback_countersnapback: double (nullable = true)
|-- event_id5snapback_countersnapback: double (nullable = true)
|-- event_id6snapback_countersnapback: double (nullable = true)
|-- event_id7snapback_countersnapback: double (nullable = true)
|-- event_id8snapback_countersnapback: double (nullable = true)
|-- event_id9snapback_countersnapback: double (nullable = true)
|-- event_idTestsnapback_countersnapback: double (nullable = true)
I try to apply kmeans algorithm.
Code
val dfJoin_products_items = df_products.join(df_items, "product_id")
dfJoin_products_items.createGlobalTempView("products_items")
val weightFreight = spark.sql("SELECT cast(product_weight_g as double) weight, cast(freight_value as double) freight FROM global_temp.products_items")
case class Rows(weight:Double, freight:Double)
val rows = weightFreight.as[Rows]
val assembler = new VectorAssembler().setInputCols(Array("weight", "freight")).setOutputCol("features")
val data = assembler.transform(rows)
val kmeans = new KMeans().setK(4)
val model = kmeans.fit(data)
Values
dfJoin_products_items
scala> dfJoin_products_items.printSchema
root
|-- product_id: string (nullable = true)
|-- product_category_name: string (nullable = true)
|-- product_name_lenght: string (nullable = true)
|-- product_description_lenght: string (nullable = true)
|-- product_photos_qty: string (nullable = true)
|-- product_weight_g: string (nullable = true)
|-- product_length_cm: string (nullable = true)
|-- product_height_cm: string (nullable = true)
|-- product_width_cm: string (nullable = true)
|-- order_id: string (nullable = true)
|-- order_item_id: string (nullable = true)
|-- seller_id: string (nullable = true)
|-- shipping_limit_date: string (nullable = true)
|-- price: string (nullable = true)
|-- freight_value: string (nullable = true)
weightFreight
scala> weightFreight.printSchema
root
|-- weight: double (nullable = true)
|-- freight: double (nullable = true)
Error
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_1 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 WARN BlockManager:66 - Block rdd_126_1 could not be removed as it was not found on disk or in memory
2019-02-03 20:51:41 WARN BlockManager:66 - Putting block rdd_126_2 failed due to exception org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>).
2019-02-03 20:51:41 ERROR Executor:91 - Exception in task 1.0 in stage 16.0 (TID 23)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<weight:double,freight:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
I don't understand this error, someone can explain me please ?
Thanks a lot!
UPDATE 1 : Full stacktrace
The stacktrace is huge, so you can find it here : https://pastebin.com/PhmZPtDk
Getting CastClassException in Spark 2 when dealing with table having complex datatype columns like Array and Array
The actions I tried is simple one: count
df=spark.sql("select * from <tablename>")
df.count
but getting below Error when running the spark application
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, sandbox.hortonworks.com, executor 1): java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
The weird thing is, the same action of dataframe in spark-shell is working fine
Table has below complex columns :
|-- sku_product: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sku_id: string (nullable = true)
| | |-- qty: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- display_name: string (nullable = true)
| | |-- sku_displ_clr_desc: string (nullable = true)
| | |-- sku_sz_desc: string (nullable = true)
| | |-- parent_product_id: string (nullable = true)
| | |-- delivery_mthd: string (nullable = true)
| | |-- pick_up_store_id: string (nullable = true)
| | |-- delivery: string (nullable = true)
|-- hitid_low: string (nullable = true)
|-- evar7: array (nullable = true)
| |-- element: string (containsNull = true)
|-- hitid_high: string (nullable = true)
|-- evar60: array (nullable = true)
| |-- element: string (containsNull = true)
let me know if any further information is needed.
I had a similar problem. I was using spark 2.1 with parquet files.
I discovered that one of the parquet files had a different schema than the others. So when I tried to read all, I got cast error.
In order to resolve it, I just checked file by file.
I'm having an issue trying to show data that's in my DF.(or more generally put, i want to visualize the data. Preferably with the SQL interperter)
I've gotten a flat text file (log excerpt) that i'm formatting through a class and then creating a DF based on the result.
The data itself looks like this (single line excerpt) :
2017-02-09 15:56:15,411 INFO [pool-2-thread-1] n.t.f.c.LoggingConsumer [LoggingConsumer.java:27] [Floor=b827eb074534, tile=00051] Data received: '[false, false, false, false, false, false, false, true]', batch size= 0
Here's my code:
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
import org.apache.spark.sql.SparkSession
import sqlContext.implicits._
val realdata = sc.textFile("/root/application.txt")
case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, sensor1: String, sensor2: String, sensor3: String, sensor4: String, sensor5: String, sensor6: String, sensor7: String, sensor8: String, batchsize: String, troepje1: String, troepje2: String)
val mapData = realdata
.filter(line => line.contains("data") && line.contains("INFO"))
.map(s => s.split(" ").toList)
.map(
s => testClass(s(0),
s(1).split(",")(0),
s(1).split(",")(1),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(15),
s(16),
s(17),
s(18),
s(19),
s(20),
s(21),
s(22),
"",
"",
""
)
).toDF
When i print the resulting schema it seems to do what i want:
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- level: string (nullable = true)
|-- unknown1: string (nullable = true)
|-- unknownConsumer: string (nullable = true)
|-- unknownConsumer2: string (nullable = true)
|-- vloer: string (nullable = true)
|-- tegel: string (nullable = true)
|-- msg: string (nullable = true)
|-- sensor1: string (nullable = true)
|-- sensor2: string (nullable = true)
|-- sensor3: string (nullable = true)
|-- sensor4: string (nullable = true)
|-- sensor5: string (nullable = true)
|-- sensor6: string (nullable = true)
|-- sensor7: string (nullable = true)
|-- sensor8: string (nullable = true)
|-- batchsize: string (nullable = true)
|-- troepje1: string (nullable = true)
|-- troepje2: string (nullable = true)
Now granted, it's not very pretty (and i do apologise for the dutch) but that seems to work.
But when i do a mapData.show or anything like that i get:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:503)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:508)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:500)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:257)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
... 56 elided
Caused by: java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:503)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:508)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:500)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:257)
... 3 more
Is there anybody that can tell me what i'm doing wrong or if there's a better way to visualize the data ?
Thank you so much for any help or guidance.
I'm trying to query two tables from cassandra into two dataframes, then join these two dataframes into one dataframe(result).
I can get the correct result and the spark job could finished normally on eclipse in my computer.
But when I submitted to Spark server (local mode), the job just hang without any exception or error message and couldn't finish after a hour until I press Ctrl+C to stop it.
I have no idea why the job cannot work on spark server, what is the difference between eclipse and spark server. If the reason is a OutofMemory problem, is it possible spark didn't throw any exception and just hang?
Any advice?
Thanks~
Submit command
/usr/bin/spark-submit --class com.test.c2c --jars file:///home/iotcloud/Documents/grace/spark/spark-cassandra-connector-1.6.3-s_2.10.jar file:///home/iotcloud/Documents/grace/spark/C2C_1205.jar
Here is my scala code:
package com.test
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import com.datastax.spark.connector.cql._;
import com.datastax.spark.connector._;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._;
import org.apache.spark.sql.cassandra._;
object c2c {
def main(args: Array[String]) {
println("Start...")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "10.2.1.67")
.setAppName("ConnectToCassandra")
.setMaster("local")
val sc = new SparkContext(conf)
println("Cassandra setting done...")
println("================================================1")
println("Start to save to cassandra...")
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("iot_test")
val df_info = cc.sql("select gatewaymac,sensormac,sensorid,sensorfrequency,status from tsensor_info where gatewaymac != 'failed'")
val df_loc = cc.sql("select sensorlocationid,sensorlocationname,company,plant,department,building,floor,sensorid from tsensorlocation_info where sensorid != 'NULL'")
println("================================================2")
println("registerTmepTable...")
df_info.registerTempTable("info")
df_loc.registerTempTable("loc")
println("================================================4")
println("mapping table...")
println("===info===")
df_info.printSchema()
df_info.take(5).foreach(println)
println("===location===")
df_loc.printSchema()
df_loc.take(5).foreach(println)
println("================================================5")
println("print mapping result")
val result = df_info.join(df_loc, "sensorid")
result.registerTempTable("ref")
result.printSchema()
result.take(5).foreach(println)
println("====Finish====")
sc.stop()
}
}
Normal result on Eclipse
Cassandra setting done...
================================================1
Start to save to cassandra...
================================================2
registerTmepTable...
================================================4
mapping table...
===info===
root
|-- gatewaymac: string (nullable = true)
|-- sensormac: string (nullable = true)
|-- sensorid: string (nullable = true)
|-- sensorfrequency: string (nullable = true)
|-- status: string (nullable = true)
[0000aaaaaaaaat7f,e9d050f0ebc25000 ,0000aaaaaaaaat7f3242,null,N]
[000000000000b219,c879b4f921c25000 ,000000000000b2193590,00:01,N]
[0000aaaaaaaaaabb,2c153cf9f0c25000 ,0000aaaaaaaaaabba353,null,Y]
[000000000000a412,17da712795c25000 ,000000000000a4126156,00:05,Y]
[000000000000a104,b2a4b8b7a6c25000 ,000000000000a1046340,00:01,N]
===location===
root
|-- sensorlocationid: string (nullable = true)
|-- sensorlocationname: string (nullable = true)
|-- company: string (nullable = true)
|-- plant: string (nullable = true)
|-- department: string (nullable = true)
|-- building: string (nullable = true)
|-- floor: string (nullable = true)
|-- sensorid: string (nullable = true)
[JA092,A1F-G-L00-S066,IAC,IACJ,MT,A,1,000000000000a108a19f]
[JA044,A2F-I-L00-S037,IAC,IACJ,MT,A,2,000000000000a2024246]
[JA111,A2F-C-L00-S076,IAC,IACJ,MPA,A,2,000000000000a210c710]
[PA041,A1F-SMT-S03,IAC,IACP,SMT,A,1,000000000000a10354c1]
[PC010,C3F-IQC-S03,IAC,IACP,IQC,C,3,000000000000c3269786]
================================================5
print mapping result
root
|-- sensorid: string (nullable = true)
|-- gatewaymac: string (nullable = true)
|-- sensormac: string (nullable = true)
|-- sensorfrequency: string (nullable = true)
|-- status: string (nullable = true)
|-- sensorlocationid: string (nullable = true)
|-- sensorlocationname: string (nullable = true)
|-- company: string (nullable = true)
|-- plant: string (nullable = true)
|-- department: string (nullable = true)
|-- building: string (nullable = true)
|-- floor: string (nullable = true)
[000000000000a10275bc,000000000000a102,e85ce9b9d2c25000 ,00:05,Y,PA030,A1F-WM-S02,IAC,IACP,WM,A,1]
[000000000000b117160c,000000000000b117,33915a79e5c25000 ,00:05,Y,PB011,B1F-WM-S01,IAC,IACP,WM,B,1]
[000000000000a309024b,000000000000a309,afdab2efbbc25000 ,00:00,N,PA101,A3F-MP6-R01,IAC,IACP,MP6,A,3]
[000000000000c6294109,000000000000c629,383cca8e45c25000 ,00:05,Y,PC017,C6F-WM-S01,IAC,IACP,WM,C,6]
[000000000000a205e52e,000000000000a205,8d83303cf4c25000 ,00:00,N,PA063,A2F-MP6-R04,IAC,IACP,MP6,A,2]
====Finish====
Finally found the answer is that I forgot to set master to spark standalone cluster.
I submitted spark job on local spark server.
After setting master to spark standalone cluster, the job works fine. Maybe it's because the local spark server doesn't have enough cores to execute the task. (It's an old machine with 2 cores. By the way, spark standalone cluster have 4 nodes, and they are all old machines.)