Using Kafka for batch job replacement - apache-kafka

I am newbie with Kafka. Want to explore possibility of using Kafka to replace our batch job system currently in place.
Current system:
We get lots of feeds EVERYDAY in flat files (CSVs, JSONs,TXTs and binary) from external vendors using FTPS, SFTP, Emails, fileshare etc. I am ashamed to say that currently all logic resides in stored procedures and vbscript. I am trying to modernize whole pipeline using Apache Kafka to ingest all these feeds. I have explored Kafka and found that I can use Kafka Connect and KSQL and SpoolDir connector for this purpose however I am not very clear on how to go about this.
Question:
I want to device system wherein I am able to ingest all incoming flat files (flat files mentioned earlier) using Kafka. I got that we can use Kafka connectors and KSQL or Streaming APIs to achiev this. Part I am not clear is how do I turn it into repetitive task using Kafka. Like every morning I get flat file feed in specific folder, how do I automate this process using Kafka like scheduling reading of files at specific time of day and every day? Do I need any kind of service (windows service or cron job) to constantly keep eye on folder to watch incoming files and process it? Is there any reasonable solution to go about it?

Reminder that Kafka is not meant for file transfers. You can ingest data about the files (locations and sizes, for example, or extract data from them to produce rather than whole files), but you'll want to store and process their full contents elsewhere
Spooldir connector will work for local filesystems, but not over FTP. For that, there's another kafka-connect-fs project
However, I generally recommend combining Apache Nifi ListenFTP processors with ProduceKafka actions for something like this.
Nifi also has Email (IMAP/POP3) and NFS/Samba (fileshare) getters that can be scheduled, and it'll handle large files much better than Kafka.
KSQL and Streams API only work once the data is in Kafka

Related

Load testing a kafka consumer

I'm trying to figure out how to load test a kafka consumer.
In my application, the consumer reads message from kafka and does a lot of work most of it is writing stuff in a database.
Since it's an important process for my team, I would like to be able to load test the consumer and be able to have some report as to how the consumption did.
The end goal of this is that it generates the report in our CI and we would be able to see the evolution of the consumption for same load of message.
Sadly I really don't see how I can achieve such a thing.
Would you have any idea as to how I would be able to do this ?
As of now, I'm thinking about duplicating the production topic on a dedicated environement and everytime I want to execute my load tests I would move the offset.
This would not help me get a report on the consumption.
Thanks for reading me.
Having a separate "load test" topic is a good idea. Depending on the topic retention policy (size/time) you can just delete the consumer offset of the application you want to test and have it start consuming from "earliest".
I don't know about your architecture, but I would highly recommend to continually monitor your application: Write proper metrics and keep an eye on them. CloudWatch would be the obvious choice when running on AWS. But there are lot of other services where you can publish metric data to (Grafana Cloud, New Relic etc.).
If you want to continually run load tests as part of your CI pipeline you should opt for a fixed input. For example use a fixed set of 100k messages you use for testing. Otherwise your results won't be deterministic and will be hard to compare. It's really helpful if your "core" data processing can be be run without depending on Kafka itself: Be input agnostic, messages may come from a file, a database or a Kafka topic.

Implement custom partitioner in kafka .net

I want to use kafka from .net application. I have created producer and consumer using .net. It is working fine.
But i have to use kafka for real time processing of file. So i am little bit confused. What should be the architecture that i need to follow and how to design the kafka application, that it is fault tolerance?
I have multiple id for file, and file can come anytime , in any number of files. So i want to use hashing algorithm to distribute the kafka partition load in different partition.
I googled many time but not get the application or sample to use this functionality in .net application.
So please help me in designing and implementing the desired algorithm using .net.
basically we have endpoints (webservices) that is given to client and client push the file in regular interval, we have window service that regularly read file from db and parse the file for further processing. all application are in .net . we want to put file in kafka and consumer (window service )will read the file from kafka. Producer will be our ends points

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

Real Time Streaming With Multiple Data Sources Using Kafka

We are planning to build a real time monitoring system with apache kafka. The overall idea is to push data from multiple data sources to kafka and perform data quality checks. I have few questions with this architecture
What are the best possible approaches of streaming data from multiple sources which mainly include java applications, oracle database, rest api's, log files to apache kafka? Note each client deployment includes each of such data sources. Hence the number of data sources pushing data to kafka would be equal to the number of customers * x where x are the types of data sources that I listed. Ideally a push approach would suit best instead of a pull approach. In the pull approach the target system would have to be configured with the credentials of various source system which would not be practical
How do we handle failures?
How do we perform data quality checks on the incoming messages? For e.g. If a certain message does not have all the required attributes, the message could be discarded and an alert could be raised for the maintenance team to check.
Kindly let me know your expert inputs. Thanks !
I think the best approach here is to use Kafka connect: link
but it's a pull approach :
Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers. Ewen

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.