I am not able to write the delta table into minio.
I am running my spark as master and worker pods in Kubernetes. using Jupyter notebook as driver and minio for storage.
While writing the delta table it is failing
df1.write.partitionBy(['asset_id']).format("delta").mode("append").option("mergeSchema", "true").save("s3a://test/asset-table")
python version: 3.7
pyspark: 3.2.2
java JDK : 8
error:
23/01/04 07:37:12 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 12) (10.244.28.3 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
but I am able to write parquet files to minio but not delta table
df1.write.partitionBy(['asset_id']).format("delta").mode("append").option("mergeSchema", "true").save("s3a://test/asset-table")
Py4JJavaError: An error occurred while calling o195.save.
: org.apache.spark.SparkException: Job aborted.
Related
I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin:
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Which yields this output (truncated to omit repeat output):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 192.168.64.3, executor 2): java.io.FileNotFoundException: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
...
I'm unable to figure out why this is happening at all as I'm too unfamiliar with Spark. Any tips? Thanks for your help.
I have created an ORC table in Hive with partitions.The data is loaded in HDFS using Apache pig in ORC format. Then Hive table is created on top of that. Partition columns are year,month and day. When i tried to read that table using spark sql , i am getting array out of bound exception. Please find below the code and error message.
Code:
myTable = spark.table("testDB.employee")
myTable.count()
Error:
ERROR Executor: Exception in task 8.0 in stage 10.0 (TID 66)
java.lang.IndexOutOfBoundsException: toIndex = 47
The datatypes in this table are String,timestamp & double. When i tried to select all the columns using select statement with the spark sql query, i am getting class cast exception as given below.
py4j.protocol.Py4JJavaError: An error occurred while calling
o536.showString. : org.apache.spark.SparkException: Job aborted due to
stage failure: Task 0 in stage 12.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 12.0 (TID 84, localhost, executor
driver): java.lang.ClassCastException: org.apache.hadoop.io.Text
cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
After this i tried to cast to timestamp using the snippet code given below. But after that also i am getting the array out of bound exception.
df2 = df.select('dt',unix_timestamp('dt', "yyyy-MM-dd HH:mm:ss") .cast(TimestampType()).alias("timestamp"))
If you don't specify the partition filter, it could cause this problem. On my side, when I specify the date beween filter, it resolves this out of bound exception.
Im trying to join a rdd with Cassandra Table using the Cassandra Spark connector:
samplerdd.joinWithCassandraTable(keyspace, CassandraParams.table)
.on(SomeColumns(t.date as a.date,
t.key as a.key)
It works in standalone mode but when I execute in cluster mode I get this error:
Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 20, 10.10.10.51): java.io.InvalidClassException: com.datastax.spark.connector.rdd.CassandraJoinRDD; local class incompatible: stream classdesc serialVersionUID = 6155891939893411978, local class serialVersionUID = 1245204129865863681
I have already checked the jars in the master and slaves and It seems sames versions.
Im using spark 2.0.0, Cassandra 3.7, Cassandra-Spark Connector 2.0.0 M2,
Cassandra Driver Core 3.1.0 and Scala 2.11.8
What could it be happening?
Finally solved. Update cassandra-driver-core dependency to 3.0.0 and works. – Manuel Valero just now edit
I'm using Zeppelin-Sandbox 0.5.6 with Spark 1.6.1 on Amazon EMR.
I am reading csv file located on s3.
The problem is that sometimes I'm getting error reading the file. I need to restart the interpreter several times until it works. nothing in my code changes. I can't restore it, and can't tell when it's happening.
My code goes as following:
defining dependencies:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.4.0")
using spark-csv:
%pyspark
import pyspark.sql.functions as func
df = sqlc.read.format("com.databricks.spark.csv").option("header", "true").load("s3://some_location/some_csv.csv")
error msg:
Py4JJavaError: An error occurred while calling o61.load. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-172-22-2-187.ec2.internal):
java.io.InvalidClassException: com.databricks.spark.csv.CsvRelation;
local class incompatible: stream classdesc serialVersionUID =
2004612352657595167, local class serialVersionUID =
6879416841002809418
...
Caused by: java.io.InvalidClassException:
com.databricks.spark.csv.CsvRelation; local class incompatible
Once I'm reading the csv into the dataframe, the rest of the code works fine.
Any advice?
Thanks!
You need to execute spark adding the spark-csv package to it like this
$ pyspark --packages com.databricks:spark-csv_2.10:1.2.0
Now the spark-csv will be in your classpath
Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more
run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.