Multiple Streams support in Apache Flink Job - frameworks

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.

Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt

Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.

Related

Use casesĀ for using multiple queries for Spark Structured streaming

I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use casesĀ multiple queries should be used over the single query
Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

Kafka -> Flink DataStream -> MongoDB

I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example (https://github.com/apache/flink).
Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.
I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.
What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?
I'm not tied to any solution, so any suggestion would be appreciated.
Below there's an example what exactly i'm getting as an input and what I need to store as an output.
Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>
Flink: DataStream.map({
return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection
As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.
As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.
Kafka -> Flink -> Kafka -> Mongo/Anything
With this approach you can mantain the "at-least-once semantics" behaivour.
There is currently no Streaming MongoDB sink available in Flink.
However, there are two ways for writing data into MongoDB:
Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.
Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.
If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.
If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.