Im trying to join a rdd with Cassandra Table using the Cassandra Spark connector:
samplerdd.joinWithCassandraTable(keyspace, CassandraParams.table)
.on(SomeColumns(t.date as a.date,
t.key as a.key)
It works in standalone mode but when I execute in cluster mode I get this error:
Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 20, 10.10.10.51): java.io.InvalidClassException: com.datastax.spark.connector.rdd.CassandraJoinRDD; local class incompatible: stream classdesc serialVersionUID = 6155891939893411978, local class serialVersionUID = 1245204129865863681
I have already checked the jars in the master and slaves and It seems sames versions.
Im using spark 2.0.0, Cassandra 3.7, Cassandra-Spark Connector 2.0.0 M2,
Cassandra Driver Core 3.1.0 and Scala 2.11.8
What could it be happening?
Finally solved. Update cassandra-driver-core dependency to 3.0.0 and works. – Manuel Valero just now edit
Related
I am not able to write the delta table into minio.
I am running my spark as master and worker pods in Kubernetes. using Jupyter notebook as driver and minio for storage.
While writing the delta table it is failing
df1.write.partitionBy(['asset_id']).format("delta").mode("append").option("mergeSchema", "true").save("s3a://test/asset-table")
python version: 3.7
pyspark: 3.2.2
java JDK : 8
error:
23/01/04 07:37:12 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 12) (10.244.28.3 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
but I am able to write parquet files to minio but not delta table
df1.write.partitionBy(['asset_id']).format("delta").mode("append").option("mergeSchema", "true").save("s3a://test/asset-table")
Py4JJavaError: An error occurred while calling o195.save.
: org.apache.spark.SparkException: Job aborted.
I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin:
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Which yields this output (truncated to omit repeat output):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 192.168.64.3, executor 2): java.io.FileNotFoundException: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
...
I'm unable to figure out why this is happening at all as I'm too unfamiliar with Spark. Any tips? Thanks for your help.
My problem is about connecting from Apache Spark to MongoDB using the official connector.
Stack versions are as follows:
Apache Spark 2.2.0 (HDP build: 2.2.0.2.6.3.0-235)
MongoDB 3.4.10 (2x-node replica set with authentification)
I use these jars:
mongo-spark-connector-assembly-2.2.0.jar which i tried both to download from Maven repo and to build by myself with a proper Mongo Driver version
mongo-java-driver.jar downloaded from Maven Repo
The issue is about version correspondence, as mentioned here and here.
They all say, that the method was renamed since Spark 2.2.0 so i need to use the connector for 2.2.0 version - and yes it is, here is the method in spark connector 2.1.1, and here is renamed one in 2.2.0
But i am sure, that i use the proper one. I did these steps:
git clone https://github.com/mongodb/mongo-spark.git
cd mongo-spark
git checkout tags/2.2.0
sbt check
sbt assembly
scp target/scala-2.11/mongo-spark-connector_2.11-2.2.0.jar user#remote-spark-server:/opt/jars
All tests was OK. After that i am using pyspark and Zeppelin (so deploy-mode is client) to read some data from MongoDB:
df = sqlc.read.format("com.mongodb.spark.sql.DefaultSource") \
.option('spark.mongodb.input.uri', 'mongodb://user:password#172.22.100.231:27017,172.22.100.234:27017/dmp?authMechanism=SCRAM-SHA-1&authSource=admin&replicaSet=rs0&connectTimeoutMS=300&readPreference=nearest') \
.option('collection', 'verification') \
.option('sampleSize', '10') \
.load()
df.show()
And got this error:
Py4JJavaError: An error occurred while calling o86.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 3, worker01, executor 1): java.lang.NoSuchMethodError:org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonTypeOfTwo()Lscala/Function2;
I think am sure this method is not in the jar: TypeCoercion$.findTightestCommonTypeOfTwo()
I go to the SparkUI and look at the environment:
spark.repl.local.jars file:file:/opt/jars/mongo-spark-connector-assembly-2.2.0.jar,file:/opt/jars/mongo-java-driver-3.6.1.jar
And no different MongoDB-related files anywhere.
Please help, what am i doing wrong? Thanks in advance
It was the filecache... helped this advice https://community.hortonworks.com/content/supportkb/150578/how-to-clear-local-file-cache-and-user-cache-for-y-1.html
I'm using Zeppelin-Sandbox 0.5.6 with Spark 1.6.1 on Amazon EMR.
I am reading csv file located on s3.
The problem is that sometimes I'm getting error reading the file. I need to restart the interpreter several times until it works. nothing in my code changes. I can't restore it, and can't tell when it's happening.
My code goes as following:
defining dependencies:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.4.0")
using spark-csv:
%pyspark
import pyspark.sql.functions as func
df = sqlc.read.format("com.databricks.spark.csv").option("header", "true").load("s3://some_location/some_csv.csv")
error msg:
Py4JJavaError: An error occurred while calling o61.load. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-172-22-2-187.ec2.internal):
java.io.InvalidClassException: com.databricks.spark.csv.CsvRelation;
local class incompatible: stream classdesc serialVersionUID =
2004612352657595167, local class serialVersionUID =
6879416841002809418
...
Caused by: java.io.InvalidClassException:
com.databricks.spark.csv.CsvRelation; local class incompatible
Once I'm reading the csv into the dataframe, the rest of the code works fine.
Any advice?
Thanks!
You need to execute spark adding the spark-csv package to it like this
$ pyspark --packages com.databricks:spark-csv_2.10:1.2.0
Now the spark-csv will be in your classpath
Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more
run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.