Cannot connect to remote MongoDB from EMR cluster with spark-shell - mongodb

I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2:
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user_name"), ("database", "meteor"), ("password", "my_password")))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
Spark-shell responds with the following error:
16/07/26 15:44:35 INFO SparkContext: Starting job: aggregate at MongodbSchema.scala:47
16/07/26 15:44:45 WARN DAGScheduler: Creating new stage failed due to exception - job: 1
com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting to connect. Client view of cluster state is {type=Unknown, servers=[{address=[IP.OF.REMOTE.HOST]:3001, type=Unknown, state=Connecting, exception={java.lang.IllegalArgumentException: response too long: 1347703880}}]
at com.mongodb.BaseCluster.getDescription(BaseCluster.java:128)
at com.mongodb.DBTCPConnector.getClusterDescription(DBTCPConnector.java:394)
at com.mongodb.DBTCPConnector.getType(DBTCPConnector.java:571)
at com.mongodb.DBTCPConnector.getReplicaSetStatus(DBTCPConnector.java:362)
at com.mongodb.Mongo.getReplicaSetStatus(Mongo.java:446)
.
.
.
After reading for a while, a few responses here in SO and other forums state that the java.lang.IllegalArgumentException: response too long: 1347703880 error might be caused by a faulty Mongo driver. Based on that I started executing spark-shell with updated drivers like so:
spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 --jars casbah-commons_2.10-3.1.1.jar,casbah-core_2.10-3.1.1.jar,casbah-query_2.10-3.1.1ja.jar,mongo-java-driver-2.13.0.jar
Of course before this I downloaded the jars and stored them in the same route as the spark-shell was executed. Nonetheless, with this approach spark-shell answers with the following cryptic error message:
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: com/mongodb/casbah/query/dsl/CurrentDateOp
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:218)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
It is worth mentioning that the target MongoDB is a Meteor Mongo database, that's why I'm trying to connect with [IP.OF.REMOTE.HOST]:3001 instead of using the port 27017.
What might be the issue? I've followed many tutorials but all of them seem to have the MongoDB in the same host, allowing them to declare localhost:27017 in the credentials. Is there something I'm missing?
Thanks for the help!

I ended up using MongoDB's official Java driver instead. This was my first experience with Spark and the Scala programming language, so I wasn't very familiar with the idea of using plain Java JARs yet.
The solution
I downloaded the necessary JARs and stored them in the same directory as the job file, which is a Scala file. So the directory looked something like:
/job_directory
|--job.scala
|--bson-3.0.1.jar
|--mongodb-driver-3.0.1.jar
|--mongodb-driver-core-3.0.1.jar
Then, I start spark-shell as follows to load the JARs and its classes into the shell environment:
spark-shell --jars "mongodb-driver-3.0.1.jar,mongodb-driver-core-3.0.1.jar,bson-3.0.1.jar"
Next, I execute the following to load the source code of the job into the spark-shell:
:load job.scala
Finally I execute the main object in my job like so:
MainObject.main(Array())
As of the code inside the MainObject, it is merely as the tutorial states:
val mongo = new MongoClient(IP_OF_REMOTE_MONGO , 27017)
val db = mongo.getDB(DB_NAME)
Hopefully this will help future readers and spark-shell/Scala beginners!

Related

Spark job dataframe write to Oracle using jdbc failing

When writing spark dataframe to Oracle database (Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit), the spark job is failing with the exception java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection. The scala code is
dataFrame.write.mode(SaveMode.Append).jdbc("jdbc:oracle:thin:#" + ipPort + ":" + sid, table, props)
Already tried setting below properties for jdbc connection but hasn't worked.
props.put("driver", "oracle.jdbc.OracleDriver")
props.setProperty("testOnBorrow","true")
props.setProperty("testOnReturn","false")
props.setProperty("testWhileIdle","false")
props.setProperty("validationQuery","SELECT 1 FROM DUAL")
props.setProperty("autoReconnect", "true")
Based on the earlier search results, it seems that the connection is opened initially but is being killed by the firewall after some idle time. The connection URL is verified and is working as the select queries work fine. Need help in getting this resolved.

Connecting to remote hive metastore from a Spark program in Intellij

First let's create a hive-enabled spark session:
val spark = SparkSession.builder.config(conf).enableHiveSupport.getOrCreate
Then let's try to connect to the remote dB:
spark.sql("use my_remote_db").show
17/12/10 10:27:02 WARN ObjectStore: Failed to get database my_remote_db, returning NoSuchObjectException
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'my_remote_db' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:173)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:268)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
at com.mycompany.sbseg.graph.poc.features.LoadGraphData$.loadGraphData(LoadGraphData.scala:8)
Note: this same code works in the spark-shell.
Here are the additional settings made in Intellij to emulate the bash shell environment used to run spark-shell:
To ensure they were getting set properly they are printed out:
Seq("SPARK_HOME","HIVE_CONF_DIR","HIVE_HOME")
.foreach{ s=>println(s"$s:${System.getenv(s)}")}
These print out the same/correct results as on the command line.
/shared/spark
/Users/sboesch/yarnconf
/usr/local/Cellar/hive/2.1.1/libexec
So it is unclear what the differences might be between the Intellij and the bash environments - and why the code does not work properly in the former.

Pyspark connection to Postgres database in ipython notebook

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.
I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.
I have the following in my .bash_profile for finding the Postgres driver:
export SPARK_CLASSPATH=/path/to/downloaded/jar
Here's what I am doing in the ipython notebook to connect to the db (based on this post):
from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)
table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)
The error:
Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.
I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.
I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.
This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:
./bin/spark-shell --packages org.postgresql:postgresql:42.1.4
Additional info:
spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4
I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.
And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).
I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'
I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

JanusGraph DynamoDB backend exceptions when committing data to the database

Hello: I am using JanusGraph with DynamoDB example from https://github.com/awslabs/dynamodb-janusgraph-storage-backend
Also, I am connecting to JanusGraph using Spark - Scala - Gremlin Scala framework. Everything thing works when I used Cassandra as the backend, but when I switch to using DynamoDB, I start getting backend exception errors.
My conf looks like this
val conf = new BaseConfiguration
conf.setProperty("gremlin.graph","org.janusgraph.core.JanusGraphFactory")
conf.setProperty("storage.write-time","1 ms")
conf.setProperty("storage.read-time","1 ms")
conf.setProperty("storage.backend","com.amazon.janusgraph.diskstorage.dynamodb.DynamoDBStoreManager")
conf.setProperty("storage.dynamodb.client.signing-region","us-east-1")
conf.setProperty("storage.dynamodb.client.endpoint","http://127.0.0.1:8000")
val graph = JanusGraphFactory.open(conf)
I can connect DynamoDB fine, but when I start to insert data, I run into backend exceptions.
Below is part of the error log
ERROR org.janusgraph.graphdb.database.StandardJanusGraph - Could not commit transaction [1] due to storage exception in system-commit
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.persist(CacheTransaction.java:95)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.flushInternal(CacheTransaction.java:143)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.commit(CacheTransaction.java:200)
at org.janusgraph.diskstorage.BackendTransaction.commit(BackendTransaction.java:150)
at org.janusgraph.graphdb.database.StandardJanusGraph.commit(StandardJanusGraph.java:703)
at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.commit(StandardJanusGraphTx.java:1363)
at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph$GraphTransaction.doCommit(JanusGraphBlueprintsGraph.java:272)
at org.apache.tinkerpop.gremlin.structure.util.AbstractTransaction.commit(AbstractTransaction.java:105)
at $line81.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:84)
at $line81.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:80)
Any idea what is going on here. I am pretty new to DynamoDB. This was working fine in Cassandra
Why do you know you are connected? I think you must provide credentials into your config. For example:
conf.setProperty("storage.dynamodb.client.credentials.class-name", "com.amazonaws.auth.BasicAWSCredentials")
conf.setProperty("storage.dynamodb.client.credentials.constructor-args", "ACCESS_KEY,SECRET_KEY")

MongoInternalException: DBPort.findOne failed while running on GAE localserver

I am trying to connect to remote MongoDB (mongolab) from my local GAE server (localhost/8888). I am using morphia and my mongodb driver version is 2.4. My code looks like this:
Mongo m = new Mongo("xyz.mongolab.com",);
Datastore datastore = new Morphia().createDatastore(m, "staging","uname","password".toCharArray());
This throws the following exception :
com.mongodb.MongoInternalException: DBPort.findOne failed
at com.mongodb.DBPort.findOne(DBPort.java:153)
at com.mongodb.DBPort.runCommand(DBPort.java:159)
at com.mongodb.DBTCPConnector.testMaster(DBTCPConnector.java:371)
at com.mongodb.Mongo.(Mongo.java:167)
Caused by: java.io.IOException: couldn't connect to [xyz.mongolab.com/:] bc:java.net.SocketException: Operation failure: setSocketOptions: Not yet implemented
at com.mongodb.DBPort._open(DBPort.java:205)
Does somebody know why this is happening ?
it was a problem with using the old mongodb driver.. works after i upgraded..