Spark to MongoDB via Mesos - mongodb

I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario

Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.

Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.

Related

two kafka versions running on same cluster

I am trying to configure two Kafka servers on a cluster of 3 nodes. while there is already one Kafka broker(0.8 version) already running with the application. and there is a dependency on that kafka version 0.8 that cannot be disturbed/upgraded .
Now for a POC, I need to configure 1.0.0 since my new code is compatible with this version and above...
my task is to push data from oracle to HIVE tables. for this I am using jdbc connect to fetch data from oracle and hive jdbc to push data to hive tables. it should be fast and easy way...
I need the following help
can I use spark-submit to run this data push to hive?
can I simply copy kafka_2.12-1.0.0 on my Linux server on one of the node and run my code on it. I think I need to configure my Zookeeper.properties and server.properties with ports not in use and start this new zookeeper and kafka services separately??? please note I cannot disturb existing zookeeper and kafka already running.
kindly help me achieve it.
I'm not sure running two very memory intensive applications (Kafka and/or Kafka Connect) on the same machines is considered very safe. Especially if you do not want to disturb existing applications. Realistically, a rolling restart w/ upgrade will be best for performance and feature reasons. And, no, two Kafka versions should not be part of the same cluster, unless you are in the middle of a rolling upgrade scenario.
If at all possible, please use new hardware... I assume Kafka 0.8 is even running on machines that could be old, and out of warranty? Then, there's no significant reason that I know of not to even use a newer version of Kafka, but yes, extract it on any machine you'd like, use perhaps use something like Ansible, or preferred config management tool you choose, to do it for you.
You can share the same Zookeeper cluster actually, just make sure it's not the same settings. For example,
Cluster 0.8
zookeeper.connect=zoo.example.com:2181/kafka08
Cluster 1.x
zookeeper.connect=zoo.example.com:2181/kafka10
Also, not clear where Spark fits into this architecture. Please don't use JDBC sink for Hive. Use the proper HDFS Kafka Connect sink, which has direct Hive support via the metastore. And while the JDBC source might work for Oracle, chances are, you might already be able to afford a license for GoldenGate
i am able to achieve two kafka version 0.8 and 1.0 running on the same server with respective zookeepers.
steps followed:
1. copy the version package folder to the server at desired location
2. changes configuration setting in zookeeper.properties and server.propeties(here you need to set port which are not in used on that particular server)
3. start the services and push data to kafka topics.
Note: this requirement is only for a POC and not an ideal production environment. as answered above we must upgrade to next level rather than what is practiced above.

Is it possible to load a database directly from HDFS into spark as a DataFrame?

I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?
I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark

MongoDB with the Spark connector

If I have a replica set with mongodb, than a primary server is receiving all the wirte/read operations and writing them to the server.
The secondary server are reading the operations from the oplog and replicating them.
Now I would like to analyze the data in mongodb replica set with spark-mongodb-connector. I can install a spark cluster on all three nodes and run analytics on it in memory.
I understand that spark cluster has a master node where I have to submit the spark job for analytics, or spark streaming. Both are installed on an application server in tomcat.
now I need to choose a master node to submit the job from my tomcat app server to the spark cluster.
Should the Primary Server be the Spark Master node? and than the driver of an application can connect to submit jobs on it?.
What would be the Spark master in a sharded cluster?
It doesn't really matter which node is the Spark Master in your cluster.
The Spark master will be responsible for assigning the tasks to the Spark executors, it will not receive all read/write requests.
Each executor will then be responsible for fetching the data it needs to process.
Be careful about data partitioning in Spark, it might happen that mongoDB only provides a single partition to start with, so you might want to do a repartition first.

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

connect to memsql from pyspark shell

Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?
Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell