Integrating an Apache spark cluster with multiple databases - mongodb

My team uses different databases, say mongodb and cassandra.
I need to know if it is possible to integrate a single spark cluster with both mongodb and cassandra clusters.
Or, in other words, is it possible to create dataframes from mongodb and cassandra in the same spark application?

Spark only sees DataFrames and RDDs. It doesn't really matter which database you're using, as long as a connector exists. You can make as many external connections as needed within a single Spark Context
Any data source that's read into those formats can be combined

Related

Reading kinesis streams continuously and do CRUD in Postgres or any DBMS tables

I am looking for a way to read kinesis structured streaming in Databricks. I am able to use a spark cluster to continuously read streaming data. However I need a way now to get those timely records and do CRUD into Postgres or any DBMS tables. Is there a way I can do this only via Databricks?

Why Spark is not utilizing MongoDB replicaset?

I have setup the same environment mentioned here, with the exception of keeping two MongoDB replicaset:
https://www.mongodb.com/blog/post/getting-started-with-mongodb-pyspark-and-jupyter-notebook
But when I am doing a simple count(*) on a collection of million records , I don’t see Spark utilizing both MongoDB. I can see it going to the primary only.
I thought that Spark will utilize both nodes ?
What could I have missed here?
Thanks
After reading the docs, I found out that dataframe partitions can't be distributed between replicaset members. However, each partition can be processed by a different spark worker.

Use cases for using multiple queries for Spark Structured streaming

I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use cases multiple queries should be used over the single query
Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

What is stored in Kafka KSQL DB?

What kind of data is stored in ksqlDB? Only metadata about KStreams and KTables?
From what I understand I can create KStreams and KTables and operations between these using either the Java API or the KSQL language. But what is ksqlDB Server actually used for besides metadata about these objects when I create them using KSQL client?
I also assume that ksqlDB actually runs the streams processors so it is also an execution engine. Does ist scale automatically? Can there be multiple instances of ksqlDB server component that communicate with each other? Is it intended for scenarios that need massive scaling, or is it just some syntactic sugar suitable for people who don't like to write Java code?
EDITED
It is explained in this video: https://www.youtube.com/embed/f3wV8W_zjwE
It does not scale automatically, but you can manually deploy multiple instances of ksqlDB server and make them join the same ksqlDB cluster identified by ksql.server.id
What kind of data is stored in ksqlDB?
Nothing is really stored "in" it. All the metadata of the query history and running processors is stored in Kafka topics and on-disk KTables
I also assume that ksqlDB actually runs the streams processors so it is also an execution engine. Does ist scale automatically? Can there be multiple instances of ksqlDB server component that communicate with each other? Is it intended for scenarios that need massive scaling, or is it just some syntactic sugar suitable for people who don't like to write Java code?
All Yes.

flume or kafka's equivalent to mongodb

In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.