Apache Spark : JDBC connection not working - postgresql

I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).
I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.
Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.
I have successfully added the required jar (tested this jar) in the classpath by running
pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
I can see that jar is successfully added in environment UI.
When I run the following in pyspark shell-
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
I get this ERROR -
>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
df = self._ssql_ctx.load(source, joptions)
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://[host]/[dbname]
at java.sql.DriverManager.getConnection(DriverManager.java:602)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
at org.apache.spark.sql.jdbc.JDBCRelation.<init> (JDBCRelation.scala:125)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:619)

I had this exact problem with mysql/mariadb, and got BIG clue from this question
So your pyspark command should be:
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.

A slightly more elegant solution:
val props = new Properties
props.put("driver", "org.postgresql.Driver")
sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)

As jake256 suggested
"driver", "org.postgresql.Driver"
key-value pair was missing. In my case, I launched pyspark as :
pyspark --jars /path/to/postgresql-9.4.1210.jar
with following instructions :
from pyspark.sql import DataFrameReader
url = 'postgresql://192.168.2.4:5432/postgres'
properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='weather', properties=properties
)
df.show()
+-------------+-------+-------+-----------+----------+
| city|temp_lo|temp_hi| prcp| date|
+-------------+-------+-------+-----------+----------+
|San Francisco| 46| 50| 0.25|1994-11-27|
|San Francisco| 43| 57| 0.0|1994-11-29|
| Hayward| 54| 37|0.239999995|1994-11-29|
+-------------+-------+-------+-----------+----------+
Tested on :
Ubuntu 16.04
PostgreSQL server version 9.5.
Postgresql driver used is postgresql-9.4.1210.jar
and Spark version is spark-2.0.0-bin-hadoop2.6
but I am also confident that it should also work on
spark-2.0.0-bin-hadoop2.7.
Java JDK 1.8 64bits
other JDBC Drivers can be found on :
https://www.petefreitag.com/articles/jdbc_urls/
tutorial I followed is on :
https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/
similar solution was suggested also on :
pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

This error seems to get thrown when you use the wrong version of JDBC driver. Check https://jdbc.postgresql.org/download.html to make sure that you have the right one.
Note in particular:
JDK 1.1 - JDBC 1. Note that with the 8.0 release JDBC 1 support has
been removed, so look to update your JDK when you update your server.
JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains
additional support for javax.sql classes.
JDK 1.4, 1.5 - JDBC 3. This contains support for SSL and javax.sql, but does not require J2EE as it has been added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but the majority of methods are implemented.
JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not
complete, but the majority of methods are implemented.

see this post please, just place your script after all the options. see this

That’s pretty straightforward. To connect to external database to retrieve data into Spark dataframes additional jar file is required. E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar in a path that’s accessible from every node in the cluster. Preferably this is a path on shared file system. E.g. with Pouta Virtual Cluster such path would be under /shared_data, here I use /shared_data/thirdparty_jars/.
With direct Spark job submissions from terminal one can specify –driver-class-path argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf file, usually in /opt/spark/conf directory.
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar
spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar

Related

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question
I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.
I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.
check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

Spark ElasticSearch Configuration - Reading Elastic Search from Spark

I am trying to read data from ElasticSearch via Spark Scala. I see lot of post addressing this question, i have tried all the options they have mentioned in various posts but seems nothing is working for me
JAR Used - elasticsearch-hadoop-5.6.8.jar (Used elasticsearch-spark-5.6.8.jar too without any success)
Elastic Search Version - 5.6.8
Spark - 2.3.0
Scala - 2.11
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").getOrCreate()
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.index.auto.create", "true").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.port", "9200").option("es.nodes", "xxxxxxxxx").option("es.nodes.wan.only", "true").option("es.net.http.auth.user","xxxxxx").option("es.net.http.auth.pass", "xxxxxxxx")
val read = reader.load("index/type")
Error:
ERROR rest.NetworkClient: Node [xxxxxxxxx:9200] failed (The server xxxxxxxxxxxxx failed to respond); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:133)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[xxxxxxxxxxx:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:655)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:287)
... 65 more
Apart from this I have also tried below properties without any success:
option("es.net.ssl.cert.allow.self.signed", "true")
option("es.net.ssl.truststore.location", "<path for elasticsearch cert file>")
option("es.net.ssl.truststore.pass", "xxxxxx")
Please note elasticsearch node is within Unix edge node and is http://xxxxxx:9200 (mentioning it just in case if that makes any difference with the code)
What I am missing here? Any other properties? Please Help
Use below Jar which support spark 2+ version instead of Elastic-Hadoop or Elastic-Spark jar.
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/5.6.8

How to tell PySpark where the pymongo-spark package is located? [duplicate]

I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.
Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.
I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.
Good Luck!
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()

Pyspark not running the sqlContext in Pycharm

I hope someone could help with this problem I am having. I have previously setup a VM in windows using CENTOS, with hadoop and spark (all in singlenode) and it was working perfectly.
I am now running a multinode setup with another computer, both running CENTOS standalone. I have installed hadoop successfully and is running on both machines. Then I've installed spark with the following setup:
Version : Spark 2.2.1-bin-hadoop2.7, with the .bashrc file as follows:
export SPARK_HOME=/opt/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PATH="/home/hadoop/anaconda2/bin:$PATH"
I am using anaconda (python 2.7 version) to install the pyspark packages. I then have the $SPARK_HOME/conf files setup as follows:
the slaves file as:
datanode1
(the hostname of the node which i use to conduct the processing on)
and the spark-env.sh file:
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.8.3/etc/hadoop
export SPARK_WORKER_CORES=6
The idea is that I then connect the spark to PyCharm IDE to do my work on. In Pycharm I have setup the environment variables (under run -> edit configurations) as
PYTHON PATH /opt/spark/spark-2.2.1-bin-hadoop2.7/python/lib
SPARK_HOME /opt/spark/spark-2.2.1-bin-hadoop2.7
I have also setup my python interpreter to point to the anaconda python directory.
With all this setup I get multiple errors as output when I call either a spark SQLContext or SparkSession.Builder, for example:
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
sql_sc = SQLContext(sc)
or
spark = SparkSession.builder.master("local").appName("PythonTutPrac").config("spark.executor.memory","2gb").getOrCreate()
The ERROR:
File "/home/hadoop/Desktop/PythonPrac/CollaborativeFiltering.py", line
72, in
.config("spark.executor.memory", "2gb") \ File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py",
line 183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value) File
"/home/hadoop/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py",
line 1160, in call
answer, self.gateway_client, self.target_id, self.name) File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)1, stackTrace) pyspark.sql.utils.IllegalArgumentException: u"Error while
instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':"
Unhandled exception in thread started by > Process finished with exit code 1
Error image
I do not know why this error message is showing, when I was running this in my VM single node, it was working fine. I then decided in my multinode setup to remove the datanode1 and just run it again as a singlenode setup with my main computer (hostname - master), but still getting the same errors.
I hope someone could help, as I have followed other guides to setup pycharm with pyspark, but could not figure out what is going wrong. Thanks!

Error running hadoop application in Eclipse on Windows

I'm trying to set up an Eclipse environment for developing and debugging hadoop. I'm following Tom White's Definitive Hadoop 3rd ed. What I would like to do is get the MaxTemperature app working locally on my Windows within Eclipse before moving it to my Hortonworks sandbox VM. The comment on page 158 about using the local job runner seems to be what I want. I don't want to set up a full hadoop implementation on Windows. I'm hoping with the right config params I can convince it to run as a java application inside Eclipse.
Windows: 7
Eclipse: Luna
Hadoop: 2.4.0
JDK: 7
When I set the Run configuration for MaxTemperatureDriver (Source code on page 157) to
inputfile outputdir foo (deliberate bogus 3rd parameter)
I get the usage message so I know I'm running my program with those params.
If I remove the bogus third param I get
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at mark.MaxTemperatureDriver.run(MaxTemperatureDriver.java:52)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at mark.MaxTemperatureDriver.main(MaxTemperatureDriver.java:56)
I've tried inserting -conf but it seems to be ignored. There is no error message if I specify a nonexistent path.
I've tried inserting -fs file:/// -jt local, but it makes no difference
I've tried inserting -D mapreduce.framework.name=local
I've tried specifying the input and output with the file: format
Note. I'm not asking about how to configure eclipse to connect to a remote Hadoop installation. I want the application to run within eclipse.
Is this possible? Any ideas?
Additional info:
I turned on debugging. I saw:
582 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Trying ClientProtocolProvider : org.apache.hadoop.mapred.YarnClientProtocolProvider
583 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Cannot pick org.apache.hadoop.mapred.YarnClientProtocolProvider as the ClientProtocolProvider - returned null protocol
I'm wondering not why YarnClientProtocolProvider failed, but why it didn't try LocalClientProtocolProvider.
New info:
It seems that this is an issue with Hadoop 2.4.0. I recreated my environment with Hadoop 1.2.1, followed the instructions in
http://gerrymcnicol.com/index.php/2014/01/02/hadoop-and-cassandra-part-4-writing-your-first-mapreduce-job/
added the Windows hack from
http://bigdatanerd.wordpress.com/2013/11/14/mapreduce-running-mapreduce-in-windows-file-system-debug-mapreduce-in-eclipse
and it all started working.
Following blog will be useful.
Running mapreduce in Windows filesystem