MongoDB Hadoop error : no FileSystem for scheme:mongodb - mongodb

I'm trying to get a basic Spark example running using mongoDB hadoop connector. I'm using Hadoop version 2.6.0. I'm using version 1.3.1 of mongo-hadoop. I'm not sure where exactly to place the jars for this Hadoop version. Here are the locations I've tried:
$HADOOP_HOME/libexec/share/hadoop/mapreduce
$HADOOP_HOME/libexec/share/hadoop/mapreduce/lib
$HADOOP_HOME/libexec/share/hadoop/hdfs
$HADOOP_HOME/libexec/share/hadoop/hdfs/lib
Here is the snippet of code I'm using to load a collection into Hadoop:
Configuration bsonConfig = new Configuration();
bsonConfig.set("mongo.job.input.format", "MongoInputFormat.class");
JavaPairRDD<Object,BSONObject> zipData = sc.newAPIHadoopFile("mongodb://127.0.0.1:27017/zipsdb.zips", MongoInputFormat.class, Object.class, BSONObject.class, bsonConfig);
I get the following error no matter where the jar is placed:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:505)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:471)
I dont see any other errors in hadoop logs. I suspect I'm missing something in my configuration, or that Hadoop 2.6.0 is not compatible with this connector. Any help is much appreciated.

Related

"Data source org.apache.phoenix.spark does not support streamed writing" in Structured Streaming

**I'm trying to connect to the Phoenix driver using Spark Structured Streaming and I'm getting the following exception when I'm trying to load the HBase table data via the Phoenix driver...please help on this **
Jars:
spark.version: 2.4.0
scala.version: 2.12
phoenix.version: 4.11.0-HBase-1.1
hbase.version: 1.4.4
confluent.version: 5.3.0
spark-sql-kafka-0-10_2.12
Code
val tableDF = sqlContext.phoenixTableAsDataFrame("DATA_TABLE", Array("ID","DEPARTMENT"), conf = configuration)
ERROR
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.phoenix.spark does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:298)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at com.spark.streaming.process.StreamProcess.processDataPackets(StreamProcess.scala:81)
at com.spark.streaming.main.$anonfun$start$1(IAlertJob.scala:55)
at com.spark.streaming.main.$anonfun$start$1$adapted(IAlertJob.scala:27)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext(SparkStreamingApplication.scala:38)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext$(SparkStreamingApplication.scala:23)

Hadoop, Spark: java.lang.NoSuchFieldError: TOKEN_KIND

I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:166)
at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at org.apache.spark.deploy.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSDelegationTokeProvider.scala:119)
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.master("yarn")
.config("cluster")
.config("spark.driver.host", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "principal.name") \
.config("spark.driver.bindAddress", "0.0.0.0") \
.getOrCreate()
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/4.11.0-HBase-1.2
jars:
phoenix-spark-4.11.0-HBase-1.2.jar
phoenix-client.jar
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit phoenix.py --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
Phoenix.py:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
df.show()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Take a look at these answers :
phoenix jdbc doesn't work, no exceptions and stuck
HBase Java client - unknown host: localhost.localdomain
Both of the issues happened in Java (with JDBC), but it looks like it's a similar issue here.
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts :
127.0.0.1 master01
if you are running all your stack locally.

Unable to connect to remote Spark Master - ROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message

I posted this question some time ago but it came out that I was using my local resources instead of remote's ones.
I have a remote machine configured with spark : 2.1.1, cassandra : 3.0.9 and scala : 2.11.6.
Cassandra is configured at localhost:9032 and spark master at localhost:7077.
Spark master is set to 127.0.0.1 and its port to 7077.
I'm able to connect to cassandra remotely but unable to do the same
thing with spark.
When connecting to the remote spark master, I get the following error:
ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
Here are my settings via code
val configuration = new SparkConf(true)
.setAppName("myApp")
.setMaster("spark://xx.xxx.xxx.xxx:7077")
.set("spark.cassandra.connection.host", "xx.xxx.xxx.xxx")
.set("spark.cassandra.connection.port", 9042)
.set("spark.cassandra.input.consistency.level","ONE")
.set("spark.driver.allowMultipleContexts", "true")
val sparkSession = SparkSession
.builder()
.appName("myAppEx")
.config(configuration)
.enableHiveSupport()
.getOrCreate()
I don't understand why cassandra works just fine and spark does not.
What's causing this? How can I solve?
I answer to this question in order to help other people who are struggling with this problem.
It came out that it was caused by a mismatch between Intellij Idea's scala version and server's one.
Server had scala ~ 2.11.6 while the IDE was using scala ~ 2.11.8.
In order to make sure of using the very same version, it was necessary to change IDE's scala version by doing the following steps:
File > Project Structure > + > Scala SDK > Find and select server's scala Version > Download it if you haven't it already installed > Ok > Apply > Ok
Could this be a typo? The errormessage reports 8990, in your connect config you have port 8090 for spark.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.