My problem is about connecting from Apache Spark to MongoDB using the official connector.
Stack versions are as follows:
Apache Spark 2.2.0 (HDP build: 2.2.0.2.6.3.0-235)
MongoDB 3.4.10 (2x-node replica set with authentification)
I use these jars:
mongo-spark-connector-assembly-2.2.0.jar which i tried both to download from Maven repo and to build by myself with a proper Mongo Driver version
mongo-java-driver.jar downloaded from Maven Repo
The issue is about version correspondence, as mentioned here and here.
They all say, that the method was renamed since Spark 2.2.0 so i need to use the connector for 2.2.0 version - and yes it is, here is the method in spark connector 2.1.1, and here is renamed one in 2.2.0
But i am sure, that i use the proper one. I did these steps:
git clone https://github.com/mongodb/mongo-spark.git
cd mongo-spark
git checkout tags/2.2.0
sbt check
sbt assembly
scp target/scala-2.11/mongo-spark-connector_2.11-2.2.0.jar user#remote-spark-server:/opt/jars
All tests was OK. After that i am using pyspark and Zeppelin (so deploy-mode is client) to read some data from MongoDB:
df = sqlc.read.format("com.mongodb.spark.sql.DefaultSource") \
.option('spark.mongodb.input.uri', 'mongodb://user:password#172.22.100.231:27017,172.22.100.234:27017/dmp?authMechanism=SCRAM-SHA-1&authSource=admin&replicaSet=rs0&connectTimeoutMS=300&readPreference=nearest') \
.option('collection', 'verification') \
.option('sampleSize', '10') \
.load()
df.show()
And got this error:
Py4JJavaError: An error occurred while calling o86.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 3, worker01, executor 1): java.lang.NoSuchMethodError:org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonTypeOfTwo()Lscala/Function2;
I think am sure this method is not in the jar: TypeCoercion$.findTightestCommonTypeOfTwo()
I go to the SparkUI and look at the environment:
spark.repl.local.jars file:file:/opt/jars/mongo-spark-connector-assembly-2.2.0.jar,file:/opt/jars/mongo-java-driver-3.6.1.jar
And no different MongoDB-related files anywhere.
Please help, what am i doing wrong? Thanks in advance
It was the filecache... helped this advice https://community.hortonworks.com/content/supportkb/150578/how-to-clear-local-file-cache-and-user-cache-for-y-1.html
Related
I would like to run spatial queries on large data sets; e.g. geopandas would be too slow.
Inspiration I found here: https://anant-sharma.medium.com/apache-sedona-geospark-using-pyspark-e60485318fbe
In Spark Pool of Synapse Analytics I prepared (via Azure Portal):
Apache Spark Pool / Settings / Packages / Requirement files:
requirement.txt:
azure-storage-file-share
geopandas
apache-sedona
Apache Spark Pool / Settings / Packages / Workspace packages:
geotools-wrapper-geotools-24.1.jar
sedona-sql-3.0_2.12-1.2.0-incubating.jar
Apache Spark Pool / Settings / Packages / Spark configuration
config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
In Pyspark Notebook
print(spark.version)
print(spark.conf.get("spark.kryo.registrator"))
print(spark.conf.get("spark.serializer"))
The output was:
3.1.2.5.0-58001107
org.apache.sedona.core.serde.SedonaKryoRegistrator
org.apache.spark.serializer.KryoSerializer
Then I tried:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
spark = SparkSession.builder.master("local[*]").appName("Sedona App").config("spark.serializer", KryoSerializer.getName).config("spark.kryo.registrator", SedonaKryoRegistrator.getName).getOrCreate()
SedonaRegistrator.registerAll(spark)
But it failed:
Py4JJavaError: An error occurred while calling o636.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
A simple check that stuff is correctly installed would probaly allow this:
%%sql
SELECT ST_Point(0,0);
Please help with getting the spatial functions registered in pyspark running in Synapse notebook!
As per the repro from my end, I'm able to successfully run the above commands without any issue.
I just installed the requirement.txt file contains apache-sedona and downloaded below two jar files:
sedona-python-adapter-3.0_2.12–1.0.0-incubating.jar
geotools-wrapper-geotools-24.0.jar
Note: config.txt file is not required.
I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:166)
at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at org.apache.spark.deploy.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSDelegationTokeProvider.scala:119)
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.master("yarn")
.config("cluster")
.config("spark.driver.host", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "principal.name") \
.config("spark.driver.bindAddress", "0.0.0.0") \
.getOrCreate()
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.
I posted this question some time ago but it came out that I was using my local resources instead of remote's ones.
I have a remote machine configured with spark : 2.1.1, cassandra : 3.0.9 and scala : 2.11.6.
Cassandra is configured at localhost:9032 and spark master at localhost:7077.
Spark master is set to 127.0.0.1 and its port to 7077.
I'm able to connect to cassandra remotely but unable to do the same
thing with spark.
When connecting to the remote spark master, I get the following error:
ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
Here are my settings via code
val configuration = new SparkConf(true)
.setAppName("myApp")
.setMaster("spark://xx.xxx.xxx.xxx:7077")
.set("spark.cassandra.connection.host", "xx.xxx.xxx.xxx")
.set("spark.cassandra.connection.port", 9042)
.set("spark.cassandra.input.consistency.level","ONE")
.set("spark.driver.allowMultipleContexts", "true")
val sparkSession = SparkSession
.builder()
.appName("myAppEx")
.config(configuration)
.enableHiveSupport()
.getOrCreate()
I don't understand why cassandra works just fine and spark does not.
What's causing this? How can I solve?
I answer to this question in order to help other people who are struggling with this problem.
It came out that it was caused by a mismatch between Intellij Idea's scala version and server's one.
Server had scala ~ 2.11.6 while the IDE was using scala ~ 2.11.8.
In order to make sure of using the very same version, it was necessary to change IDE's scala version by doing the following steps:
File > Project Structure > + > Scala SDK > Find and select server's scala Version > Download it if you haven't it already installed > Ok > Apply > Ok
Could this be a typo? The errormessage reports 8990, in your connect config you have port 8090 for spark.
Im trying to join a rdd with Cassandra Table using the Cassandra Spark connector:
samplerdd.joinWithCassandraTable(keyspace, CassandraParams.table)
.on(SomeColumns(t.date as a.date,
t.key as a.key)
It works in standalone mode but when I execute in cluster mode I get this error:
Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 20, 10.10.10.51): java.io.InvalidClassException: com.datastax.spark.connector.rdd.CassandraJoinRDD; local class incompatible: stream classdesc serialVersionUID = 6155891939893411978, local class serialVersionUID = 1245204129865863681
I have already checked the jars in the master and slaves and It seems sames versions.
Im using spark 2.0.0, Cassandra 3.7, Cassandra-Spark Connector 2.0.0 M2,
Cassandra Driver Core 3.1.0 and Scala 2.11.8
What could it be happening?
Finally solved. Update cassandra-driver-core dependency to 3.0.0 and works. – Manuel Valero just now edit
I'm trying to get a basic Spark example running using mongoDB hadoop connector. I'm using Hadoop version 2.6.0. I'm using version 1.3.1 of mongo-hadoop. I'm not sure where exactly to place the jars for this Hadoop version. Here are the locations I've tried:
$HADOOP_HOME/libexec/share/hadoop/mapreduce
$HADOOP_HOME/libexec/share/hadoop/mapreduce/lib
$HADOOP_HOME/libexec/share/hadoop/hdfs
$HADOOP_HOME/libexec/share/hadoop/hdfs/lib
Here is the snippet of code I'm using to load a collection into Hadoop:
Configuration bsonConfig = new Configuration();
bsonConfig.set("mongo.job.input.format", "MongoInputFormat.class");
JavaPairRDD<Object,BSONObject> zipData = sc.newAPIHadoopFile("mongodb://127.0.0.1:27017/zipsdb.zips", MongoInputFormat.class, Object.class, BSONObject.class, bsonConfig);
I get the following error no matter where the jar is placed:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:505)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:471)
I dont see any other errors in hadoop logs. I suspect I'm missing something in my configuration, or that Hadoop 2.6.0 is not compatible with this connector. Any help is much appreciated.