Kafka to MongoDB using Spring Cloud Dataflow - mongodb

I'm working on a project where i have to process data coming from Kafka cluter, processing it and send it to MongoDB. The application should be deployable on the Pivotal Cloud foundary. After doing some research on the internet, i found the toolkit Spring-Cloud-Dataflow to be interesting since it can be deployed in PCF. I'm wondering how we can use it to create our real time streaming pipeline. For the moment, i'm thinking about using Kafka Streams and Spring Cloud Stream to process and transform the streams of topics but i don't know how to integrate it in SCDF and also how we can send those streams to MongoDB. I'm sorry if my question is not clear, i'm entierly new to those frameworks.
Thanks in advance

You could use the named-destination support in SCDF to directly consume events from Kafka or any other Spring Cloud Stream supported message broker implementations.
Now, for the write portion, you can use the out-of-the-box MongoDB-sink application that we build, maintain, and ship.
If you have to do some processing before you write to MongoDB, you can create a custom Spring Cloud Stream application with the desired binder implementation [see: dev-guide/docs].
To put this all together, if we assume you have events coming from a Kafka topic named Customers, and the custom processor doing some transformation on each of the received payloads (let's assume the name of the processor as CustomerTransformer), and finally the writing part to MongoDB.
Here's a take of this streaming data pipeline use-case designed from SCDF's Dashboard:

Related

How to operate a Kafka cluster and a streaming application 24/7 on budget?

I want to stream financial data (trades, orderbook) from an exchange websocket endpoint and store that data somewhere to build up my own data history for backtesting purposes. Furthermore I might want to analyze the data in real time.
I found the idea of an event driven system very interesting so that I ended up building my own dockerized confluent Kafka cluster (with avro schema-registry) and a python producer that sends the streaming data into a Kafka topic. Then I set up a Faust app to stream process the data and store it as a new topic in Kafka.
It's working fine on my laptop, but now I'm wondering how I could put this to production? Obviously I cannot do it on my laptop, because I need this application to run 24/7 without interruptions.
When I look at the fully managed Kafka cloud solutions like confluent then I find it quite expensive, especially as I'm not running a business, it's rather a private hobby project. And maybe I don't even need that kind of highly scalable and professional service.
What could be a cost efficient approach for me to get my streaming and storage application to work?
Is there another Kafka cloud solution more reduced to my needs?
Should I set up my own server? Maybe raspberry pi?
Or should I use a different approach?
I'm sorry if my problem description might not be very specific, it's a reflection of me being overwhelmed with all these system architecture questions and cloud services.
Any advice and recommendation are appreciated!

Kafka Messages Processing

I am using Kafka distributed system for message processing in spring boot application. Now my application are producing messages on even basic to three different different topics. There is one separate spring boot application which will be used by some data analysis team who will analysis the data. This application is a simple report type application with only one filter Topic.
Now I have to implement this but I am little bit confused how I will show the data to the UI. I have written listeners (Consumers) who are consuming the messages but how I will show the data to the UI on real time basic. Should I need to store it in some database like redis and then show this data to UI? Is this the correct way to deal with consumer in Kafka? Will it not be slow? As messages can grow drastically over the time.
In nutshell I want to know to how we can show messages on any UI in the efficient way and in real time.
Thanks
You can write a consumer to forward to a websocket.
Or you can use Kafka Connect to write to a database, then write a REST API
Or use Kafka Streams Interactive Queries feature + add a RPC layer on top for Javascript to call

Spring Cloud Stream vs Kafka Stream for Exactly Once feature

I was not able to find here and on Spring website and blogs if Spring Cloud Stream is able to provide "Exactly Once" semantic provided by Kafka Stream APIs.
Maybe there is not a single configuration/annotation and in the thread "Is it possible to get exactly once processing with Spring Cloud Stream?" I can find something useful, but the answer is a very high level from expert.
Thanks for help
Spring Cloud Stream does not do anything particular regarding processing guarantees. You can delegate that to Kafka Streams by providing the processing.guarantee property and setting that to exactly-once. See this for more details. When using Spring Cloud Stream Kafka Streams binder, you can provide this as a property to the Spring Boot application as below.
spring.cloud.stream.kafka.streams.binder.configuration.processing.guarantee.
Keep in mind that Kafka Stream's exactly once guarantee only works if you are writing the results back to Kafka.

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.