Implement custom partitioner in kafka .net - apache-kafka

I want to use kafka from .net application. I have created producer and consumer using .net. It is working fine.
But i have to use kafka for real time processing of file. So i am little bit confused. What should be the architecture that i need to follow and how to design the kafka application, that it is fault tolerance?
I have multiple id for file, and file can come anytime , in any number of files. So i want to use hashing algorithm to distribute the kafka partition load in different partition.
I googled many time but not get the application or sample to use this functionality in .net application.
So please help me in designing and implementing the desired algorithm using .net.
basically we have endpoints (webservices) that is given to client and client push the file in regular interval, we have window service that regularly read file from db and parse the file for further processing. all application are in .net . we want to put file in kafka and consumer (window service )will read the file from kafka. Producer will be our ends points

Related

Using Kafka for batch job replacement

I am newbie with Kafka. Want to explore possibility of using Kafka to replace our batch job system currently in place.
Current system:
We get lots of feeds EVERYDAY in flat files (CSVs, JSONs,TXTs and binary) from external vendors using FTPS, SFTP, Emails, fileshare etc. I am ashamed to say that currently all logic resides in stored procedures and vbscript. I am trying to modernize whole pipeline using Apache Kafka to ingest all these feeds. I have explored Kafka and found that I can use Kafka Connect and KSQL and SpoolDir connector for this purpose however I am not very clear on how to go about this.
Question:
I want to device system wherein I am able to ingest all incoming flat files (flat files mentioned earlier) using Kafka. I got that we can use Kafka connectors and KSQL or Streaming APIs to achiev this. Part I am not clear is how do I turn it into repetitive task using Kafka. Like every morning I get flat file feed in specific folder, how do I automate this process using Kafka like scheduling reading of files at specific time of day and every day? Do I need any kind of service (windows service or cron job) to constantly keep eye on folder to watch incoming files and process it? Is there any reasonable solution to go about it?
Reminder that Kafka is not meant for file transfers. You can ingest data about the files (locations and sizes, for example, or extract data from them to produce rather than whole files), but you'll want to store and process their full contents elsewhere
Spooldir connector will work for local filesystems, but not over FTP. For that, there's another kafka-connect-fs project
However, I generally recommend combining Apache Nifi ListenFTP processors with ProduceKafka actions for something like this.
Nifi also has Email (IMAP/POP3) and NFS/Samba (fileshare) getters that can be scheduled, and it'll handle large files much better than Kafka.
KSQL and Streams API only work once the data is in Kafka

Kafka Messages Processing

I am using Kafka distributed system for message processing in spring boot application. Now my application are producing messages on even basic to three different different topics. There is one separate spring boot application which will be used by some data analysis team who will analysis the data. This application is a simple report type application with only one filter Topic.
Now I have to implement this but I am little bit confused how I will show the data to the UI. I have written listeners (Consumers) who are consuming the messages but how I will show the data to the UI on real time basic. Should I need to store it in some database like redis and then show this data to UI? Is this the correct way to deal with consumer in Kafka? Will it not be slow? As messages can grow drastically over the time.
In nutshell I want to know to how we can show messages on any UI in the efficient way and in real time.
Thanks
You can write a consumer to forward to a websocket.
Or you can use Kafka Connect to write to a database, then write a REST API
Or use Kafka Streams Interactive Queries feature + add a RPC layer on top for Javascript to call

Can Apache Kafka be used with Desktop Application

I am creating a new desktop application in WPF. Objective of the application is to read data from a device and show real time charts in WPF clients. At the same time I want to save incoming data to database. There are few other operation to be performed on incoming data in parallel. The connected device produce data at around 1000 messages per second. I was trying to evaluate different message queuing systems for this scenario and came across Apache Kafka as one of the alternative. I have few questions regarding the same:
Is this even valid use case for Kafka? In other words, should we even use Kafka for desktop applications?
Are there any example project/POC for the same? I cannot find any such examples.
What about problem of shipping the application? Since we are not going to have any central server for desktop applications, we have to run / deploy kafka on each user's system.

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.