How do implement a variable similar to Spark's Accumulators in Apache Beam - apache-beam

I am currently using Apache Beam 2.29.0 in Spark. My pipeline consumes data from Kafka for which I have a custom KafkaConsumer that Beam creates thru a call to a ConsumerFactoryFn. I need to share a piece of persistent data for the duration of the run among the custom Kafka consumers. This would have been very simple in Spark I would have create an Accumulator variable that all the executors will have access to as well as the driver.
Since Beam is designed to run in multiple platforms Spark, Flink, Google Dataflow it does not provide this functionality. Does anybody knows a way to implement this?

I believe Side Inputs should work. You can read about the Side Inputs here. A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection.
Here Is an example of how to use it.

Related

Nifi and Spark Integration

I want to create a Spark Session within a Nifi Custom processor written in Scala, so far I can create my spark session on a scala project, but when I add this spark session inside the OnTrigger method of the nifi custom processor, the spark session is never created, is there any way to achieve this? so far I have imported spark-core and spark-sql libraries
any feedback is appreciated
Not possible with Flow File. Period.
You need Kafka in between Spark Streaming or Spark Structured Streaming. Here is good read btw: https://community.cloudera.com/t5/Community-Articles/Spark-Structured-Streaming-with-NiFi-and-Kafka-using-PySpark/ta-p/245068

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Beam / Cloud Dataflow: How to Add Kafka (or PubSub) topics to Running Stream

(How) is it possible to dynamically add or remove topics to a running pipeline as a source or sink (Kafka or PubSub)? Or have as a sink a dynamic pattern like it is possible with BigQuery Table names.
Some background: We have different topics, one per customer, to better facilitate downstream aggregations and also clean/up add them on the fly. Kafka is used to be able to backfill calculations over periods that are longer than possible with PubSub.
The options I have in my mind right now are either extending KafkaIO to support this, or to update the pipeline each time there is a topic added removed (meaning there will be some lags in the stream while its updated). Or maybe I'm having a wrong design pattern in my head and there are other solutions for this.
You are correct that right now the easiest solution is updating the pipeline.
However, a new API called Splittable DoFn (SDF) is currently in active development; it is already available in the Cloud Dataflow runner in streaming mode and in the Direct runner, and implementation is in progress in Flink and Apex runners.
It makes it possible to do things like "create a PCollection of Kafka topic names and read each of those topics", so you can have one pipeline stage produce names of topics to be read (e.g. the names themselves could arrive over Kafka or Pubsub every time a customer is added, or you could write an SDF to watch the result of a database query returning a list of customers and emit new ones), and another stage reading those topics.
See http://s.apache.org/splittable-do-fn for the design doc of the API, and http://s.apache.org/textio-sdf for an example proposed refactoring of TextIO using this API - you may want to try to modify KafkaIO yourself in a similar fashion.

Multiple Streams support in Apache Flink Job

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.
Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt
Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.

flume or kafka's equivalent to mongodb

In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.