MongoDB Hadoop error : no FileSystem for scheme:mongodb

I'm trying to get a basic Spark example running using mongoDB hadoop connector. I'm using Hadoop version 2.6.0. I'm using version 1.3.1 of mongo-hadoop. I'm not sure where exactly to place the jars for this Hadoop version. Here are the locations I've tried:
Here is the snippet of code I'm using to load a collection into Hadoop:
Configuration bsonConfig = new Configuration();
bsonConfig.set("mongo.job.input.format", "MongoInputFormat.class");
JavaPairRDD<Object,BSONObject> zipData = sc.newAPIHadoopFile("mongodb://", MongoInputFormat.class, Object.class, BSONObject.class, bsonConfig);
I get the following error no matter where the jar is placed:
Exception in thread "main" No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.Path.getFileSystem(
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
I dont see any other errors in hadoop logs. I suspect I'm missing something in my configuration, or that Hadoop 2.6.0 is not compatible with this connector. Any help is much appreciated.


"Data source org.apache.phoenix.spark does not support streamed writing" in Structured Streaming

**I'm trying to connect to the Phoenix driver using Spark Structured Streaming and I'm getting the following exception when I'm trying to load the HBase table data via the Phoenix driver...please help on this **
spark.version: 2.4.0
scala.version: 2.12
phoenix.version: 4.11.0-HBase-1.1
hbase.version: 1.4.4
confluent.version: 5.3.0
val tableDF = sqlContext.phoenixTableAsDataFrame("DATA_TABLE", Array("ID","DEPARTMENT"), conf = configuration)
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.phoenix.spark does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:298)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at com.spark.streaming.process.StreamProcess.processDataPackets(StreamProcess.scala:81)
at com.spark.streaming.main.$anonfun$start$1(IAlertJob.scala:55)
at com.spark.streaming.main.$anonfun$start$1$adapted(IAlertJob.scala:27)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext(SparkStreamingApplication.scala:38)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext$(SparkStreamingApplication.scala:23)

Hadoop, Spark: java.lang.NoSuchFieldError: TOKEN_KIND

I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.config("", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "") \
.config("spark.driver.bindAddress", "") \
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df ="org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts : master01
if you are running all your stack locally.

Unable to connect to remote Spark Master - ROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message

I posted this question some time ago but it came out that I was using my local resources instead of remote's ones.
I have a remote machine configured with spark : 2.1.1, cassandra : 3.0.9 and scala : 2.11.6.
Cassandra is configured at localhost:9032 and spark master at localhost:7077.
Spark master is set to and its port to 7077.
I'm able to connect to cassandra remotely but unable to do the same
thing with spark.
When connecting to the remote spark master, I get the following error:
ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
Here are my settings via code
val configuration = new SparkConf(true)
.set("", "")
.set("spark.cassandra.connection.port", 9042)
.set("spark.driver.allowMultipleContexts", "true")
val sparkSession = SparkSession
I don't understand why cassandra works just fine and spark does not.
What's causing this? How can I solve?
I answer to this question in order to help other people who are struggling with this problem.
It came out that it was caused by a mismatch between Intellij Idea's scala version and server's one.
Server had scala ~ 2.11.6 while the IDE was using scala ~ 2.11.8.
In order to make sure of using the very same version, it was necessary to change IDE's scala version by doing the following steps:
File > Project Structure > + > Scala SDK > Find and select server's scala Version > Download it if you haven't it already installed > Ok > Apply > Ok
Could this be a typo? The errormessage reports 8990, in your connect config you have port 8090 for spark.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p ="s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(
at org.apache.hadoop.conf.Configuration.getClassByName(
at org.apache.hadoop.conf.Configuration.getClass(
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.Path.getFileSystem(
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
And here is the second:
val p ="s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.Path.getFileSystem(
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.