Unable to Connect to remote Apache-Spark - scala

I'm new to apache-spark and I'm experiencing some issues while trying to connect from my local machine to a remote server which contains a Spark working instance.
I successfully managed to connect vis SSH tunnel to that server using JSCH but I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$; at
org.apache.spark.ui.jobs.AllJobsPage.(AllJobsPage.scala:39) at
org.apache.spark.ui.jobs.JobsTab.(JobsTab.scala:38) at
org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:65) at
org.apache.spark.ui.SparkUI.(SparkUI.scala:82) at
org.apache.spark.ui.SparkUI$.create(SparkUI.scala:220) at
org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:162) at
org.apache.spark.SparkContext.(SparkContext.scala:452) at
server.Server$.main(Server.scala:45) at
server.Server.main(Server.scala)
When trying to connect to Spark.
This is my scala code
val conf = new SparkConf().setAppName("Test").setMaster("spark://xx.xxx.xxx.x:7077")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array(1, 2, 3, 4, 5)).count()
println(rdd)
Where line 45 highlighted at (Server.scala:45) in the error is the one with new SparkContext(conf).
Both on local and remote machine I'm using scala ~ 2.11.6. On my local pom.xml file I imported scala : 2.11.6, spark-core_2.10 and spark-sql_2.10 both ~2.1.1. On my server I installed spark ~ 2.1.1. ON the server I also managed to setup the master as the local machine by editing conf/spark-env.sh.
Of course, I managed to test server's spark and It works just fine.
What Am I doing wrong?

from the docs of setMaster:
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
If you run it from the spark cluster (as I understand you are), you should use local[n]

Related

jupyter notebook connecting to Apache Spark 3.0

I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

How to access remote HDFS cluster from my PC

I'm trying to access a remote cloudera HDFS cluster from my local PC (win7). As cricket_007 suggested in my last question I did the following things:
(1) I created the next Spark session
val spark = SparkSession
.builder()
.appName("API")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.master("local")
.enableHiveSupport()
.getOrCreate()
(2) I copied the next files from the cluster :
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml
and configured the variable HADOOP_CONF_DIR to the directory that contains them
(3) I downloaded Spark and configured the variables SPARK_HOME and SPARK_CONF_DIR
(4) I downloaded winutils and set it in the path variable. I changed the permissions of /tmp/hive to 777.
When the master set to local I see only the default database which means it doesn't identify the XML files. When it is set to yarn the screen is stuck and it looks like my pc is thinking but it is taking to much time and doesn't end. When I use local and I also use the line .config("hive.metastore.uris","thrift://MyMaster:9083") everything works well.
Why might this be happening? Why locally I see only the default database? Why when the master set to yarn I can't connect and it is stuck? And why when I add the config line it solved my problem only locally?

Connect to SQL Data Warehouse from HDInsight OnDemand

I'm trying to read/write data to an Azure SQL Data Warehouse from a spark on demand HDInsight cluster.
I can do this from a normal HDInsight spark cluster by using a script action to install the jdbc driver but I don't think it's possible to run script actions on the on demand clusters.
I've tried
Copying the files from %user%.m2\repository\com\microsoft\sqlserver\mssql-jdbc\6.2.2.jre8 up to blob storage in a folder called jars next to where the built spark code is.
including the driver dependency in the built jar file
Both of these led to a java.lang.NoClassDefFoundError
I'm not too familiar with scala/maven/JVM/etc so not sure what else to try or include in this SO question.
Scala code i'm trying to run is
val sqlContext = SparkSession.builder().appName("GenerateEventsSql").getOrCreate()
val jdbcSqlConnStr = "jdbc:sqlserver://someserver.database.windows.net:1433;databaseName=myDW;user=admin;password=XXXX;"
val tableName = "dbo.SomeTable"
val allTableData = sqlContext.read.format("jdbc")
.options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConnStr, "dbtable" -> tableName)
)
.load()
Jars on Blob storage folder are not accessible to the Class path of HDinsight spark job. You need to copy the jar files to the local host for example /tmp/jars/xyz.jar and mention the same in Spark-submit command.
For e.g.
nohup spark-submit --jars /tmp/jars/xyz.jar

pyorient doesn't connect to OrientDB on port 2424 from cross-domain

I'm connecting to my orientDB from one instance on AWS to another instance:
client = pyorient.OrientDB("129.123.12.123", 2424)
client.db_open(
"MyDB",
"root",
"secret",
db_type=pyorient.DB_TYPE_GRAPH
)
The db_open call just hangs without connecting or errors. I suspect it's because I'm connecting from another IP. Is there a way around this? I have 1 server that host all my code and dockers but my orientDB nodes, running in a distributed cluster, have different IP's.
This seems to be a bug with pyorient 1.5.4. The OrientSerialization.CSV gets stuck in an infinite loop when connecting to OrientDB in distributed mode.
There is a development branch on pyorient that implements the missing binary serialiser.
Install it with:
pip install https://github.com/mogui/pyorient/tarball/develop#egg=pyorient
connect using:
client = pyorient.OrientDB("129.123.12.123", 2424, serialization_type=pyorient.OrientSerialization.Binary)
This works but is obviously not stable yet.