I'm working on Apache Spark Streaming (NO SPARK SQL) and I would like to store the results of my script on a mongodb.
How can I do that? I found some tutorial but they work only with Apache Spark SQL.
Thanks
Related
I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?
I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark
Is it possible to read in Hbase tables directly as Pyspark Dataframes without using Hive or Phoenix or the spark-Hbase connector provided by Hortonworks?
I'm comparatively new to Hbase and couldn't find a direct Python example to convert Hbase tables into Pyspark dataframes. Most of the examples I saw were either in Scala or Java.
You may connect to HBase via Phoenix. The sample code can be:
df=sqlContext.read.format('jdbc').options(driver="org.apache.phoenix.jdbc.PhoenixDriver",url='jdbc:phoenix:url:port:/hbase-unsecure',dbtable='table_name').load()
You may need to get spark phoenix connector jars: phoenix-spark-4.7.0-HBase-1.1.jar and phoenix-4.7.0-HBase-1.1-client.jar. Thanks
I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.
I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.
Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?
Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell