java.lang.NoSuchMethodError: com.mongodb.internal.connection.Cluster.selectServer - mongodb

I am new to Apache Spark and I am using Scala and Mongodb to learn it.
https://docs.mongodb.com/spark-connector/current/scala-api/
I am trying to read the RDD from my MongoDB database, my notebook script as below:
import com.mongodb.spark.config._
import com.mongodb.spark._
val readConfig = ReadConfig(Map("uri" -> "mongodb+srv://root:root#mongodbcluster.td5gp.mongodb.net/test_database.test_collection?retryWrites=true&w=majority"))
val testRDD = MongoSpark.load(sc, readConfig)
print(testRDD.collect)
At the print(testRDD.collect) line, I got this error:
java.lang.NoSuchMethodError:
com.mongodb.internal.connection.Cluster.selectServer(Lcom/mongodb/selector/ServerSelector;)Lcom/mongodb/internal/connection/Server;
And more than 10 lines "at..."
Used libraries:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
org.mongodb.scala:mongo-scala-driver_2.12:4.2.3
Is this the problem from Mongodb internal libraries or how could I fix it?
Many thanks

I suspect that there is a conflict between mongo-spark-connector and mongo-scala-driver. The former is using Mongo driver 4.0.5, but the later is based on the version 4.2.3. I would recommend to try only with mongo-spark-connector

I was also facing the same problem, solved it using mongo-spark-connector-2.12:3.0.1 jar and with that also added jar of Scalaj HTTP » 2.4.2. It's working fine now.

Related

load pmml (generated by sklearn) in spark to predict but get error

I am following instruction jpmml-evaluator-spark to load local pmml model
my code is like below
import java.io.File
import org.jpmml.evaluator.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
// load pmml
val new File(getClass.getClassLoader.getResource("random_forest.pmml").getFile)
// create evaluator
val evaluator = EvaluatorUtil.createEvaluator(pmmlFile)
I cannot show the error message directly, so I put it here
guesses:
there are some reasons i think may cause this problem
1, "jpmml-evaluator-spark" does not support PMML4.3, even if the author said new version 1.1.0 has already supported PMML4.3
2, there are some problems about my "random_forest.pmml", because this file is from others
Note:
development environment
spark 2.1.1
scala 2.11.8
and I run on the local, mac system version is OS X El Capitan Version 10.11.6
You are using Apache Spark 2.0, 2.1 or 2.2, which has prepended a legacy version of the JPMML-Model library (1.2.15, to be precise) to your application classpath. This issue is documented in SPARK-15526.
Solution - fix your application classpath as described in JPMML-Evaluator-Spark documentation (alternatively, consider switching to Apache Spark 2.3.0 or newer).
Another option of using PMML in Spark is PMML4S-Spark, which supports the latest PMML4.4, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile(pmmlFile)
val scoreDf = model.transform(df)

Install Local spark on OSX

I am trying to run my Scala job on my local machine (a MacBook pro osx10.13.3) and I am having an error at runtime.
My versions:
scala: 2.11.12
spark: 2.3.0
hadoop: 3.0.0
I installed everything through brew.
The exception is:
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
Happening at those line:
val conf = new SparkConf()
.setAppName(getName)
.setMaster("local[2]")
val context = new SparkContext(conf)
The last line is where the exception is thrown.
My theory is that Hadoop and spark version arent working together but I can't find online what version should Hadoop be for spark 2.3.0.
Thank you.
So I figured out my problem.
So first, yes, I don't need Hadoop installed. Thanks for pointing that out.
And second I had java10 installed instead of java8. Removing it solved the rest of the problems.
Thank you everyone !

Netezza connection with Spark / Scala JDBC

I've set up Spark 2.2.0 on my Windows machine using Scala 2.11.8 on IntelliJ IDE. I'm trying to make Spark connect to Netezza using JDBC drivers.
I've read through this link and added the com.ibm.spark.netezzajars to my project through Maven. I attempt to run the Scala script below just to test the connection:
package jdbc
object SimpleScalaSpark {
def main(args: Array[String]) {
import org.apache.spark.sql.{SparkSession, SQLContext}
import com.ibm.spark.netezza
val spark = SparkSession.builder
.master("local")
.appName("SimpleScalaSpark")
.getOrCreate()
val sqlContext = SparkSession.builder()
.appName("SimpleScalaSpark")
.master("local")
.getOrCreate()
val nzoptions = Map("url" -> "jdbc:netezza://SERVER:5480/DATABASE",
"user" -> "USER",
"password" -> "PASSWORD",
"dbtable" -> "ADMIN.TABLENAME")
val df = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
}
}
However I get the following error:
17/07/27 16:28:17 ERROR NetezzaJdbcUtils$: Couldn't find class org.netezza.Driver
java.lang.ClassNotFoundException: org.netezza.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:49)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:netezza://SERVER:5480/DATABASE
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:54)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
I have two ideas:
1) I don't don't believe I actually installed any Netezza JDBC driver, though I thought the jars I brought into my project from the link above was sufficient. Am I just missing a driver or am I missing something in my Scala script?
2) In the same link, the author makes mention of starting the Netezza Spark package:
For example, to use the Spark Netezza package with Spark’s interactive
shell, start it as shown below:
$SPARK_HOME/bin/spark-shell –packages
com.ibm.SparkTC:spark-netezza_2.10:0.1.1
–driver-class-path~/nzjdbc.jar
I don't believe I'm invoking any package apart from jdbc in my script. Do I have to add that to my script?
Thanks!
Your 1st idea is right, I think. You almost certainly need to install the Netezza JDBC driver if you have not done this already.
From the link you posted:
This package can be deployed as part of an application program or from
Spark tools such as spark-shell, spark-sql. To use the package in the
application, you have to specify it in your application’s build
dependency. When using from Spark tools, add the package using
–packages command line option. Netezza JDBC driver also should be
added to the application dependencies.
The Netezza driver is something you have to download yourself, and you need support entitlement to get access to it (via IBM's Fix Central or Passport Advantage). It is included in either the Windows driver/client support package, or the linux driver package.

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?
I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.

error: not found: value sc

I am new to Scala and am trying to code read a file using the following code
scala> val textFile = sc.textFile("README.md")
scala> textFile.count()
But I keep getting the following error
error: not found: value sc
I have tried everything, but nothing seems to work. I am using Scala version 2.10.4 and Spark 1.1.0 (I have even tried Spark 1.2.0 but it doesn't work either). I have sbt installed and compiled yet not able to run sbt/sbt assembly. Is the error because of this?
You should run this code using ./spark-shell. It's scala repl with provided sparkContext. You can find it in your apache spark distribution in folder spark-1.4.1/bin.