I used Cassandra Datastax driver 4.x in Scala. I build a minimal session as Datastax explain in its documentation:
ref. https://github.com/datastax/java-driver/tree/4.x/manual/core#cqlsession
val session: CqlSession = CqlSession.builder.build()
I have few keyspaces and tables in my cassandra instance (apache cassandra 3.11). However, when I try to get all keyspaces by getMetadata method, it returns an empty list.
val keyspaces = session.getMetadata.getKeyspaces.values()
Does anyone have the same problem or know what happens here?
Thanks so much!
Related
I'am looking for a solution where I can save my kafka streaming RDD to Redis with zscore and in Append Mode. Do we have any Connector to do this - Tried spark redis connector by RedisLabs but it is only compatible with Scala 2.10 . There is one more github by Anchormen https://github.com/Anchormen/spark-redis-connector but it is again lacking documentation and also jar is not available on maven.
collection.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
// save rdd to redis as list with zscore
}
})
So , any suggestions please. Thanks,
I'm working on Apache Spark Streaming (NO SPARK SQL) and I would like to store the results of my script on a mongodb.
How can I do that? I found some tutorial but they work only with Apache Spark SQL.
Thanks
I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.
I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.
I want to access tables registered in spark using JDBC kind of service, using the thrift service provided by spark.
I didn't got any documentation for this on google, can anyone please tell me how to use thrift server to access spark tables.
and what will be the lifetime of these table in memory, will these table resides in memory till thrift server is running.
The Thrift server documentation is located in the Spark SQL reference page under Distributed SQL Engine (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine).
If you cache a query the cached result will stay in memory until the Thrift server is stopped.