I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.
Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.
I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.
Good Luck!
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()
Related
For staging and production, my code will be running on PySpark. However, in my local development environment, I will not be running my code on PySpark.
This presents a problem from the standpoint of logging. Because one uses the Java library Log4J via Py4J when using PySpark, one will not be using Log4J for the local development.
Thankfully, the API for Log4J and the core Python logging module are the same: once you get a logger object, with either module you simply debug() or info() etc.
Thus, I wish to detect whether or not my code is being imported/run in PySpark or a non-PySpark environment: similar to:
class App:
def our_logger(self):
if self.running_under_spark():
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
How might I implement running_under_spark()
Simply trying to import pyspark and seeing if it works is not a fail-proof way of doing this because I have pyspark in my dev environment to kill warnings about non-imported modules in the code from my IDE.
Maybe you can set some environment variable in your spark environment that you check for at runtime ( in $SPARK_HOME/conf/spark-env.sh):
export SPARKY=spark
Then you check if SPARKY exists to determine if you're in your spark environment.
from os import environ
class App:
def our_logger(self):
if environ.get('SPARKY') is not None:
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question
I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.
I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.
check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster
I am new to Apache Beam and I came from Spark world where the API is so rich.
How can I get the schema of a Parquet file using Apache Beam? without that I load data in memory as sometimes it risks to be huge and I am interested only in knowing the columns, and optionally the columns type.
The language is Python.
The storage system is Google Cloud Storage, and the Apache Beam job must be run in Dataflow.
FYI, I have tried the following as suggested in the sof:
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata
First, it didn't work when I give it a gs://.. path, giving me this error : error: No such file or directory
Then I have tried for a local file in my machine, and I have slightly changed the code to :
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata.schema
And so I could have the columns :
<pyarrow._parquet.ParquetSchema object at 0x10927cfd0>
name: BYTE_ARRAY
age: INT64
hobbies: BYTE_ARRAY String
But this solution as it seems to me it requires me to get this file to local (of Dataflow server??) and it doesn't use Apache Beam.
Any (better) solution?
Thank you!
I'm happy I could come up with a hand made solution after reading the code source of apache_beam.io.parquetio :
import pyarrow.parquet as pq
from apache_beam.io.parquetio import _ParquetSource
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<json_key_path>'
ps = _ParquetSource("", None, None, None) # file_pattern, min_bundle_size, validate, columns
with ps.open_file("<GCS_path_of_parquet_file>") as f:
pf = pq.ParquetFile(f)
print(pf.metadata.schema)
I hope someone could help with this problem I am having. I have previously setup a VM in windows using CENTOS, with hadoop and spark (all in singlenode) and it was working perfectly.
I am now running a multinode setup with another computer, both running CENTOS standalone. I have installed hadoop successfully and is running on both machines. Then I've installed spark with the following setup:
Version : Spark 2.2.1-bin-hadoop2.7, with the .bashrc file as follows:
export SPARK_HOME=/opt/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PATH="/home/hadoop/anaconda2/bin:$PATH"
I am using anaconda (python 2.7 version) to install the pyspark packages. I then have the $SPARK_HOME/conf files setup as follows:
the slaves file as:
datanode1
(the hostname of the node which i use to conduct the processing on)
and the spark-env.sh file:
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.8.3/etc/hadoop
export SPARK_WORKER_CORES=6
The idea is that I then connect the spark to PyCharm IDE to do my work on. In Pycharm I have setup the environment variables (under run -> edit configurations) as
PYTHON PATH /opt/spark/spark-2.2.1-bin-hadoop2.7/python/lib
SPARK_HOME /opt/spark/spark-2.2.1-bin-hadoop2.7
I have also setup my python interpreter to point to the anaconda python directory.
With all this setup I get multiple errors as output when I call either a spark SQLContext or SparkSession.Builder, for example:
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
sql_sc = SQLContext(sc)
or
spark = SparkSession.builder.master("local").appName("PythonTutPrac").config("spark.executor.memory","2gb").getOrCreate()
The ERROR:
File "/home/hadoop/Desktop/PythonPrac/CollaborativeFiltering.py", line
72, in
.config("spark.executor.memory", "2gb") \ File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py",
line 183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value) File
"/home/hadoop/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py",
line 1160, in call
answer, self.gateway_client, self.target_id, self.name) File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)1, stackTrace) pyspark.sql.utils.IllegalArgumentException: u"Error while
instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':"
Unhandled exception in thread started by > Process finished with exit code 1
Error image
I do not know why this error message is showing, when I was running this in my VM single node, it was working fine. I then decided in my multinode setup to remove the datanode1 and just run it again as a singlenode setup with my main computer (hostname - master), but still getting the same errors.
I hope someone could help, as I have followed other guides to setup pycharm with pyspark, but could not figure out what is going wrong. Thanks!
I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).
I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.
Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.
I have successfully added the required jar (tested this jar) in the classpath by running
pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
I can see that jar is successfully added in environment UI.
When I run the following in pyspark shell-
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
I get this ERROR -
>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
df = self._ssql_ctx.load(source, joptions)
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://[host]/[dbname]
at java.sql.DriverManager.getConnection(DriverManager.java:602)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
at org.apache.spark.sql.jdbc.JDBCRelation.<init> (JDBCRelation.scala:125)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:619)
I had this exact problem with mysql/mariadb, and got BIG clue from this question
So your pyspark command should be:
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.
A slightly more elegant solution:
val props = new Properties
props.put("driver", "org.postgresql.Driver")
sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)
As jake256 suggested
"driver", "org.postgresql.Driver"
key-value pair was missing. In my case, I launched pyspark as :
pyspark --jars /path/to/postgresql-9.4.1210.jar
with following instructions :
from pyspark.sql import DataFrameReader
url = 'postgresql://192.168.2.4:5432/postgres'
properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='weather', properties=properties
)
df.show()
+-------------+-------+-------+-----------+----------+
| city|temp_lo|temp_hi| prcp| date|
+-------------+-------+-------+-----------+----------+
|San Francisco| 46| 50| 0.25|1994-11-27|
|San Francisco| 43| 57| 0.0|1994-11-29|
| Hayward| 54| 37|0.239999995|1994-11-29|
+-------------+-------+-------+-----------+----------+
Tested on :
Ubuntu 16.04
PostgreSQL server version 9.5.
Postgresql driver used is postgresql-9.4.1210.jar
and Spark version is spark-2.0.0-bin-hadoop2.6
but I am also confident that it should also work on
spark-2.0.0-bin-hadoop2.7.
Java JDK 1.8 64bits
other JDBC Drivers can be found on :
https://www.petefreitag.com/articles/jdbc_urls/
tutorial I followed is on :
https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/
similar solution was suggested also on :
pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver
This error seems to get thrown when you use the wrong version of JDBC driver. Check https://jdbc.postgresql.org/download.html to make sure that you have the right one.
Note in particular:
JDK 1.1 - JDBC 1. Note that with the 8.0 release JDBC 1 support has
been removed, so look to update your JDK when you update your server.
JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains
additional support for javax.sql classes.
JDK 1.4, 1.5 - JDBC 3. This contains support for SSL and javax.sql, but does not require J2EE as it has been added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but the majority of methods are implemented.
JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not
complete, but the majority of methods are implemented.
see this post please, just place your script after all the options. see this
That’s pretty straightforward. To connect to external database to retrieve data into Spark dataframes additional jar file is required. E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar in a path that’s accessible from every node in the cluster. Preferably this is a path on shared file system. E.g. with Pouta Virtual Cluster such path would be under /shared_data, here I use /shared_data/thirdparty_jars/.
With direct Spark job submissions from terminal one can specify –driver-class-path argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf file, usually in /opt/spark/conf directory.
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar
spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar