I have setup the same environment mentioned here, with the exception of keeping two MongoDB replicaset:
https://www.mongodb.com/blog/post/getting-started-with-mongodb-pyspark-and-jupyter-notebook
But when I am doing a simple count(*) on a collection of million records , I don’t see Spark utilizing both MongoDB. I can see it going to the primary only.
I thought that Spark will utilize both nodes ?
What could I have missed here?
Thanks
After reading the docs, I found out that dataframe partitions can't be distributed between replicaset members. However, each partition can be processed by a different spark worker.
Related
I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.
I'm trying to load data from Kafka topic to Postgres using Jdbc sink connector . Now, how do we know the number of records are loaded so far into Postgres. As of now I keep on checking number of records in db using sql query. Is there any other way I can know about it?
Kafka Connect doesn't track this. I see nothing wrong with SELECT COUNT(*) on the table, however this doesn't exclude other processes writing to that table as well
it is not possible in KAFKA. Because once you have sinked the records into the target DB, KAFKA is already done its job. But you can track number of records that you are updating using SINK Record Collections write into your local file or insert into a KAFKA State store.
I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use cases multiple queries should be used over the single query
Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
My team uses different databases, say mongodb and cassandra.
I need to know if it is possible to integrate a single spark cluster with both mongodb and cassandra clusters.
Or, in other words, is it possible to create dataframes from mongodb and cassandra in the same spark application?
Spark only sees DataFrames and RDDs. It doesn't really matter which database you're using, as long as a connector exists. You can make as many external connections as needed within a single Spark Context
Any data source that's read into those formats can be combined
We have a mongodb database which keep getting data from different sources, i want to keep pushing this data to kafka as producer in real time so that i can have spark kafka integration for my analytics. Let me know if anyone has done anything on this or if there is any probable solution to this. Flume doesnot support mongodb as source and sqoop is for RDBMS.
You can use Kafka Connect for that:
https://www.confluent.io/product/connectors/
As per the above, there are at least 2 source connectors for mongodb available:
https://github.com/DataReply/kafka-connect-mongodb
https://github.com/teambition/kafka-connect-mongo