connect to memsql from pyspark shell - pyspark

Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?

Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell

Related

How to get Nats messages using pySpark (no Scala)

Found two libraries for working with Nats, for Java and Scala(https://github.com/Logimethods/nats-connector-spark-scala, https://github.com/Logimethods/nats-connector-spark). Writing a separate connector on Scala and sending the output to pySpark its wrong. Is there any other way to connect pySpark to Nats?
Spark-submit version 2.3.0.2.6.5.1100-53
Thanks in advance.
In general, there is no normal way :( I found only an option using a python connector, sending output to a socket, and there pyspark processes the received data.

How to read HBase table as a pyspark dataframe?

Is it possible to read in Hbase tables directly as Pyspark Dataframes without using Hive or Phoenix or the spark-Hbase connector provided by Hortonworks?
I'm comparatively new to Hbase and couldn't find a direct Python example to convert Hbase tables into Pyspark dataframes. Most of the examples I saw were either in Scala or Java.
You may connect to HBase via Phoenix. The sample code can be:
df=sqlContext.read.format('jdbc').options(driver="org.apache.phoenix.jdbc.PhoenixDriver",url='jdbc:phoenix:url:port:/hbase-unsecure',dbtable='table_name').load()
You may need to get spark phoenix connector jars: phoenix-spark-4.7.0-HBase-1.1.jar and phoenix-4.7.0-HBase-1.1-client.jar. Thanks

Store Spark DStream to mongodb

I'm working on Apache Spark Streaming (NO SPARK SQL) and I would like to store the results of my script on a mongodb.
How can I do that? I found some tutorial but they work only with Apache Spark SQL.
Thanks

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

Spark to MongoDB via Mesos

I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.