How to push matching data between topic 1 and topic 2 in another topic 3 ?
when sending messages from producer to consumer?
I have not worked with Spark but I can give you some direction form Apache Storm perspective Apache Storm
Build a topology with 2 kafka spouts each consuming from topic1 and topic2
Consume this data in a bolt and compare the data. You may use single bolt or series of successive bolts. You may need to use some persistence viz. mongodb or something such as redis or memcache , depending on you comparison logic
Push the common data to new kafka topic Send data to kafka from Storm using kafka bolt
This is very Apache Storm specific solution, may not be the most ideal or suitable or efficient one, but aimed to give general idea
Here is a link to basic concepts in storm Storm Concepts
I've been working with Spark for over six months now, and yes it is absolutely possible. To be honest, fairly simple. But putting spark on is a bit exaggerated for this problem. What about Kafka Streams? I have never worked with them, but should this not solve exactly this problem?
If u want to use spark:
Use the Spark Kafka integration (I used spark-streaming-kafka-0-10) to consume and to produce the Data, shoud be very simply. Than look for the Spark streaming Api in the documentation.
A simple join about the 2 DStreams should solve the problem. If u want to store Data who doesn`t match u can window it or use the UpdateStateByKey function. I hope it helps someone. Good Luck :)
Related
I am trying out the sample KafkaStreams code from Chapter 4 from the book - Kafka Streams in Action. I pretty much copied the code in github - https://github.com/bbejeck/kafka-streams-in-action/blob/master/src/main/java/bbejeck/chapter_4/ZMartKafkaStreamsAddStateApp.java This is an example using StateStore. When I run the code as is, no data is flowing through the topology. I verified that mock data is being generated, as I can see the offset in the input topic - transactions go up. However, nothing in the output topics, and nothing is printed to console.
However, when I comment line 81-88 (https://github.com/bbejeck/kafka-streams-in-action/blob/master/src/main/java/bbejeck/chapter_4/ZMartKafkaStreamsAddStateApp.java#L81-L88), basically avoid creating the "through()" processor node, the code works. I see data being generated to the "patterns" topics, and output generate in console.
I am using Kafka broker and client version 2.4. Would appreciate any help or pointers to debug the issue.
Thank you,
Ahmed.
It is well documented, that you need to create intermediate topic that you use via through() manually and upfront before you start your application. Intermediate topics, similar to input and output topics are not managed by Kafka Streams, but it's the users responsibility to manage them.
Cf: https://docs.confluent.io/current/streams/developer-guide/manage-topics.html
Btw: there is work in progress to add a new repartition() operator that allows you to repartition via a topic that will be managed by Kafka Streams (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint)
I'm new to Kafka and I'd like to know if what I'm planning is possible and reasonable to implement.
Suppose we have two sources, s1 and s2 that emit some messages to topics t1 and t2 respectively. Now, I'd like to have a sink which listens to both topics and I'd like it to process tuples of messages <m1, m2> where m1.key == m2.key.
If m1.key was never found in some message of s2, then the sink completely ignores m1.key (will never process it).
In summary, the sink will work only on keys that s1 and s2 worked on.
Some traditional and maybe naive solution would be to have some sort of cache or storage and to work on an item only when both of the messages are in the cache.
I'd like to know if Kafka offers a solution to this problem.
Most modern stream processing engines, such as Apache Flink, Kafka Streams or Spark Streaming can solve this problem for you. All three have battle tested Kafka consumers built for use cases like this.
Even within those frameworks, there are multiple different ways to achieve a streaming join like the above.
In Flink for example, one could use the Table API which has a SQL-like syntax.
What I have used in the past looks a bit like the example in this SO answer (you can just replace fromElements with a Kafka Source).
One thing to keep in mind when working with streams is that you do NOT have any ordering guarantees when consuming data from two Kafka topics t1 and t2. Your code needs to account for messages arriving in any order.
Edit - Just realised your question was probably about how you can implement the join using Kafka Streams as opposed to a stream of data from Kafka. In this case you will probably find relevant info here
I am using Kafka 0.10 and Flume 1.8. I am trying to get information on below (but could not get it). So can any body please help me.
Is there any way to send events to particular kafka topic partition
And if so, then can we read such events (coming to specific partition) with flume using hive sink
I'm not sure I understand your motive... I'm pretty sure you can create a kafka topic with a single partition if you wish to.
By doing this, you would know which partition and topic you were reading from. It is also possible to have multiple sources in flume, so if you wish for a single service to read from multiple topics but for each topic to only have a single partition, you can easily do this.
Apologies, I would have written this as a comment as it should really be a comment but I don't yet have that privilege in stackoverflow. Anyway, I hope this helps.
I really want to get an architectural solution for my below scenario.
I have a source of events (Say sensors in oil wells , around 50000 ), that produces events to a server. At the server side I want to process all these events in such a way that , the information from the sensors about latest humidity, temperature,pressure ...etc will be stored/updated to a database.
I am confused with flume or kafka.
Can somebody please address my simple scenario in architectural terms.
I don't want to store the event somewhere, since I am already updating the database with latest values.
Should I really need spark , (flume/kafka) + spark , to meet the processing side?.
Can we do any kind of processing using flume without a sink?
Sounds like you need to use the Kafka producer API to publish the events to a topic then simply read those events either by using the Kafka consumer API to write to your database or use the Kafka JDBC sink connector.
Also if you need just the latest data inside Kafka take a look at log compaction.
One way would be to push all the messages to Kafka Topic. Using Spark Stream you can ingest and process from the kafka topic. Spark streaming can directly process from your Kafka Topic
I know, storm doesn't guarantee total ordering gurantee for kafka topics, but see in many documents, storm guarantees consumption/processing the messages maintaining the order at partition level.
I am looking for a sample storm topology, that consumes/processes the messages of a kafka topic maintaining the order of messages at a kafka partition level.. NOT Total Order!! ONLY partition level ordering guarantee.
please share if you know any sample application. Thanks a lot!!
Have you looked at Apache Storm examples here? https://github.com/apache/storm/tree/master/external/storm-kafka
You may want to consider standard example and scale it based on your needs. Also, while defining Schema for the KafkaSpuout, you may want to output some key as part of the tuple and later use FieldG
rouping.