Connecting to remote hive metastore from a Spark program in Intellij - scala

First let's create a hive-enabled spark session:
val spark = SparkSession.builder.config(conf).enableHiveSupport.getOrCreate
Then let's try to connect to the remote dB:
spark.sql("use my_remote_db").show
17/12/10 10:27:02 WARN ObjectStore: Failed to get database my_remote_db, returning NoSuchObjectException
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'my_remote_db' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:173)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:268)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
at com.mycompany.sbseg.graph.poc.features.LoadGraphData$.loadGraphData(LoadGraphData.scala:8)
Note: this same code works in the spark-shell.
Here are the additional settings made in Intellij to emulate the bash shell environment used to run spark-shell:
To ensure they were getting set properly they are printed out:
Seq("SPARK_HOME","HIVE_CONF_DIR","HIVE_HOME")
.foreach{ s=>println(s"$s:${System.getenv(s)}")}
These print out the same/correct results as on the command line.
/shared/spark
/Users/sboesch/yarnconf
/usr/local/Cellar/hive/2.1.1/libexec
So it is unclear what the differences might be between the Intellij and the bash environments - and why the code does not work properly in the former.

Related

Sqoop - Postgres No Conecction Parameters Specified

I am trying to connect to Postgres DB using Sqoop (the end goal is to import tables directly into HDFS), however I am facing the issue below.
sqoop list-tables --connect jdbc:postgresql://<server_name>:5432/aae_data --username my_username -P --verbose
Warning: /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p2260.2452/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/04/24 00:13:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.1
18/04/24 00:13:40 DEBUG tool.BaseSqoopTool: Enabled debug logging.
Enter password:
18/04/24 00:13:44 DEBUG sqoop.ConnFactory: Loaded manager factory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
18/04/24 00:13:44 DEBUG sqoop.ConnFactory: Loaded manager factory: com.cloudera.sqoop.manager.DefaultManagerFactory
18/04/24 00:13:44 DEBUG sqoop.ConnFactory: Trying ManagerFactory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
18/04/24 00:13:45 DEBUG oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop can be called by Sqoop!
18/04/24 00:13:45 DEBUG sqoop.ConnFactory: Trying ManagerFactory: com.cloudera.sqoop.manager.DefaultManagerFactory
18/04/24 00:13:45 DEBUG manager.DefaultManagerFactory: Trying with scheme: jdbc:postgresql:
18/04/24 00:13:45 INFO manager.SqlManager: Using default fetchSize of 1000
18/04/24 00:13:45 DEBUG sqoop.ConnFactory: Instantiated ConnManager org.apache.sqoop.manager.PostgresqlManager#56a6d5a6
18/04/24 00:13:45 DEBUG manager.SqlManager: No connection paramenters specified. Using regular API for making connection.
Does anyone know what might be the issue here?
Do I need to specify a connection manager? If yes, how do I pass the jar file?
Thank You.

Spark job dataframe write to Oracle using jdbc failing

When writing spark dataframe to Oracle database (Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit), the spark job is failing with the exception java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection. The scala code is
dataFrame.write.mode(SaveMode.Append).jdbc("jdbc:oracle:thin:#" + ipPort + ":" + sid, table, props)
Already tried setting below properties for jdbc connection but hasn't worked.
props.put("driver", "oracle.jdbc.OracleDriver")
props.setProperty("testOnBorrow","true")
props.setProperty("testOnReturn","false")
props.setProperty("testWhileIdle","false")
props.setProperty("validationQuery","SELECT 1 FROM DUAL")
props.setProperty("autoReconnect", "true")
Based on the earlier search results, it seems that the connection is opened initially but is being killed by the firewall after some idle time. The connection URL is verified and is working as the select queries work fine. Need help in getting this resolved.

Pyspark connection to Postgres database in ipython notebook

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.
I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.
I have the following in my .bash_profile for finding the Postgres driver:
export SPARK_CLASSPATH=/path/to/downloaded/jar
Here's what I am doing in the ipython notebook to connect to the db (based on this post):
from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)
table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)
The error:
Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.
I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.
I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.
This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:
./bin/spark-shell --packages org.postgresql:postgresql:42.1.4
Additional info:
spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4
I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.
And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).
I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'
I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

Cannot connect to remote MongoDB from EMR cluster with spark-shell

I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2:
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user_name"), ("database", "meteor"), ("password", "my_password")))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
Spark-shell responds with the following error:
16/07/26 15:44:35 INFO SparkContext: Starting job: aggregate at MongodbSchema.scala:47
16/07/26 15:44:45 WARN DAGScheduler: Creating new stage failed due to exception - job: 1
com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting to connect. Client view of cluster state is {type=Unknown, servers=[{address=[IP.OF.REMOTE.HOST]:3001, type=Unknown, state=Connecting, exception={java.lang.IllegalArgumentException: response too long: 1347703880}}]
at com.mongodb.BaseCluster.getDescription(BaseCluster.java:128)
at com.mongodb.DBTCPConnector.getClusterDescription(DBTCPConnector.java:394)
at com.mongodb.DBTCPConnector.getType(DBTCPConnector.java:571)
at com.mongodb.DBTCPConnector.getReplicaSetStatus(DBTCPConnector.java:362)
at com.mongodb.Mongo.getReplicaSetStatus(Mongo.java:446)
.
.
.
After reading for a while, a few responses here in SO and other forums state that the java.lang.IllegalArgumentException: response too long: 1347703880 error might be caused by a faulty Mongo driver. Based on that I started executing spark-shell with updated drivers like so:
spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 --jars casbah-commons_2.10-3.1.1.jar,casbah-core_2.10-3.1.1.jar,casbah-query_2.10-3.1.1ja.jar,mongo-java-driver-2.13.0.jar
Of course before this I downloaded the jars and stored them in the same route as the spark-shell was executed. Nonetheless, with this approach spark-shell answers with the following cryptic error message:
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: com/mongodb/casbah/query/dsl/CurrentDateOp
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:218)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
It is worth mentioning that the target MongoDB is a Meteor Mongo database, that's why I'm trying to connect with [IP.OF.REMOTE.HOST]:3001 instead of using the port 27017.
What might be the issue? I've followed many tutorials but all of them seem to have the MongoDB in the same host, allowing them to declare localhost:27017 in the credentials. Is there something I'm missing?
Thanks for the help!
I ended up using MongoDB's official Java driver instead. This was my first experience with Spark and the Scala programming language, so I wasn't very familiar with the idea of using plain Java JARs yet.
The solution
I downloaded the necessary JARs and stored them in the same directory as the job file, which is a Scala file. So the directory looked something like:
/job_directory
|--job.scala
|--bson-3.0.1.jar
|--mongodb-driver-3.0.1.jar
|--mongodb-driver-core-3.0.1.jar
Then, I start spark-shell as follows to load the JARs and its classes into the shell environment:
spark-shell --jars "mongodb-driver-3.0.1.jar,mongodb-driver-core-3.0.1.jar,bson-3.0.1.jar"
Next, I execute the following to load the source code of the job into the spark-shell:
:load job.scala
Finally I execute the main object in my job like so:
MainObject.main(Array())
As of the code inside the MainObject, it is merely as the tutorial states:
val mongo = new MongoClient(IP_OF_REMOTE_MONGO , 27017)
val db = mongo.getDB(DB_NAME)
Hopefully this will help future readers and spark-shell/Scala beginners!

grails db-migration dbm-update command failure

I am using grails 2.4.2 and postgresql with grails db-migration 1.4.0 plugin. Whenever I issue the grails dbm-update command I get the following exception thereby resulting in failure of tables creation:
Starting dbm-update for database user # jdbc:postgresql://localhost:5432/mydb
liquibase.exception.LockException: liquibase.exception.DatabaseException: Empty result set, expected one row
at liquibase.lockservice.LockService.acquireLock(LockService.java:121)
at liquibase.lockservice.LockService.waitForLock(LockService.java:61)
at liquibase.Liquibase.update(Liquibase.java:102)
at DbmUpdate$_run_closure1_closure2.doCall(DbmUpdate:26)
at DbmUpdate$_run_closure2_closure11.doCall(DbmUpdate:59)
at grails.plugin.databasemigration.MigrationUtils.executeInSession(MigrationUtils.groovy:133)
at DbmUpdate$_run_closure2.doCall(DbmUpdate:51)
at DbmUpdate$_run_closure1.doCall(DbmUpdate:25)
Caused by: liquibase.exception.DatabaseException: Empty result set, expected one row
at liquibase.util.JdbcUtils.requiredSingleResult(JdbcUtils.java:124)
at liquibase.executor.jvm.JdbcExecutor.queryForObject(JdbcExecutor.java:159)
at liquibase.executor.jvm.JdbcExecutor.queryForObject(JdbcExecutor.java:167)
at liquibase.executor.jvm.JdbcExecutor.queryForObject(JdbcExecutor.java:163)
at liquibase.lockservice.LockService.acquireLock(LockService.java:96)
... 7 more
Any help will be appreciated.