I need to send files data to an ftp server using kafka connect sink, after an entire file is received by the server, I also have to sen an acknowledgment to change the status of those tasks in the db.
I would like to know what's the best way to go here, I initially thought about creating a custom ftp kafka connect which will also change the task db status.
Is it the best way to go, or are there other options here?
What's the requirement driving the need to update the status in the task database?
Off the top of my head, you could write a custom application that tracks the offset of the FTP sink vs available messages, and updates a database as needed.
Worth noting is that https://www.confluent.io/product/control-center/ can track the delivery of messages in a Kafka Connect pipeline, and alert on latencies.
Related
I am newbie with Kafka. Want to explore possibility of using Kafka to replace our batch job system currently in place.
Current system:
We get lots of feeds EVERYDAY in flat files (CSVs, JSONs,TXTs and binary) from external vendors using FTPS, SFTP, Emails, fileshare etc. I am ashamed to say that currently all logic resides in stored procedures and vbscript. I am trying to modernize whole pipeline using Apache Kafka to ingest all these feeds. I have explored Kafka and found that I can use Kafka Connect and KSQL and SpoolDir connector for this purpose however I am not very clear on how to go about this.
Question:
I want to device system wherein I am able to ingest all incoming flat files (flat files mentioned earlier) using Kafka. I got that we can use Kafka connectors and KSQL or Streaming APIs to achiev this. Part I am not clear is how do I turn it into repetitive task using Kafka. Like every morning I get flat file feed in specific folder, how do I automate this process using Kafka like scheduling reading of files at specific time of day and every day? Do I need any kind of service (windows service or cron job) to constantly keep eye on folder to watch incoming files and process it? Is there any reasonable solution to go about it?
Reminder that Kafka is not meant for file transfers. You can ingest data about the files (locations and sizes, for example, or extract data from them to produce rather than whole files), but you'll want to store and process their full contents elsewhere
Spooldir connector will work for local filesystems, but not over FTP. For that, there's another kafka-connect-fs project
However, I generally recommend combining Apache Nifi ListenFTP processors with ProduceKafka actions for something like this.
Nifi also has Email (IMAP/POP3) and NFS/Samba (fileshare) getters that can be scheduled, and it'll handle large files much better than Kafka.
KSQL and Streams API only work once the data is in Kafka
I have a system that uses mongoDB as persistence and RabbitMQ as Message broker. I have a challenge that I only want to implement transactional outbox for RabbitMQ publish fail scenarios. I'm not sure is it possible because, I also have consumers that is using same mongoDB persistence so when I'm writing a code that covers transactional outbox for RabbitMQ publish fail scenarios, published messages reaching consumers before mongoDB commitTransaction so my consumer couldn't find the message in mongoDB because of latency.
My code is something like below;
1- start session transaction
2- insert into document with session (so it doesn't persist until I call commit)
3- publish rabbitMQ
4- if success commitTransaction
5- if error insert into outbox document with session than commitTransaction
6- if something went wrong on mongoDB abortTransaction (if published succeed and mongoDB failed, my consumers first check for mongoDB existence and if it doesn't exist don't do anything.)
So the problem is in here messages reaching consumer earlier than
mongoDB persistence, do you advice any solution that covers my
problem?
As far as I can tell the architecture outlined in the picture in https://microservices.io/patterns/data/transactional-outbox.html maps directly to MongoDB change streams:
keep the transaction around 1
insert into the outbox table in the transaction
setup message relay process which requests a change stream on the outbox table and for every inserted document publishes a message to the message broker
The publication to message broker can be retried and the change stream reads can also be retried in case of any errors. You need to track resume tokens correctly, see e.g. https://docs.mongodb.com/ruby-driver/master/reference/change-streams/#resuming-a-change-stream.
Limitations of this approach:
only one message relay process, no scalability and no redundancy - if it dies you won't get notifications until it comes back
Your proposed solution has a different set of issues, for example by publishing notifications before committing you open yourself up to the possibility of the notification processor not being able to find the document it got out of the message broker as you said.
So I would like to share my solution.
Unfortunately it's not possible to implement transactional outbox pattern for only fail scenarios.
What I decided is, create an architecture around High Availability so;
MongoDB as High Available persistence and RabbitMQ as High Available message broker.
I removed all session transactions that I coded before and implemented immediate write and publish.
In worst case scenario:
1- insert into document (success)
2- rabbitmq publish (failed)
3- insert into outbox (failed)
What will I have is, unpublished documents in my mongo. Even in worst case scenario I could re publish messages from MongoDB with another application but I'll not write that application until I'll face with that case because we can not cover every fail scenarios on our code. So our message brokers or persistences must be high available.
I am working on a personal project in which I want to be able to send one message from a producer to an end-user.
Each message will have a key that identifies the user that has to receive the message.
This is the overall structure I have imagined:
I cannot figure out how I can tell the load balancer that whenever a user with key 2 for example contacts the load balancer, then we have to set up a connection (possibly with a WebSocket) with the consumer handling partitions with key 2 in them. Probably something can be done by using the same technique Kafka uses whenever it has to assign a partition to the key, or by keeping track of the keys each consumer manages.
I do not know whether this is possible, but even if it was, the technique I described would probably make the code too coupled with the architecture.
Could you please help me out with how I can achieve this? I do not want to store messages on a remote data store and retrieve them from a random consumer. I want the consumer to be able to serve the user as soon as possible whenever a connection is established with it. If there is no connection with that user, then I can store the message and deliver it when the connection is ready.
I eventually found useful the Push Messaging technique they use at Netflix. The trick is to add another level of indirection, made of web servers. Whenever a new client connects to one of the web servers, a tuple <client_id, webserver_id> is saved in an external data store. When the consumer needs to send the message to the client having that specific key, it looks it up on the external registry to find where the client is connected. Once it finds it, it sends the message to the right web server that will push the message to the client.
There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.
I mean I know how to get data into kafka either by some file agent or programmatically using any of the clients, but speaking from architectural point of view...
It can't just be collecting HTTP logs.
I'm assuming when someone clicks a link or does something of interest, we can use some kind of ajax/javascript call to make a call to some microservice to capture the extra info that we want? But that's not always "reliable" per say, but do we care?
Or while the given "action" posts back to the server we simultaneously write to Kafka and perform the other action?
It’s not clear from your question if you are trying to collect all the clickstream logs from a set of web servers, or if you are trying to selective publish some data to Kafka from your web app, so I will answer both.
The easiest way to collect every web click is to configure your web servers to use Syslog ( see http://archive.oreilly.com/pub/a/sysadmin/2006/10/12/httpd-syslog.html ) and configure your Syslog server to send data to Kafka (see https://www.balabit.com/documents/syslog-ng-ose-latest-guides/en/syslog-ng-ose-guide-admin/html/configuring-destinations-kafka.html). Alternatively there are some more advanced features available in this Kafka Connector for Syslog-NG (see https://github.com/jcustenborder/kafka-connect-syslog). You can also write httpd logs to a file and use a Kafka File Connector to publish to Kafka (see https://docs.confluent.io/current/connect/connect-filestream/filestream_connector.html)
If you just want to enable your apps to send certain log data to Kafka directly you can use the Kafka REST Proxy and publish using a simple HTTP POST from either your client JavaScript or your server side logic (see https://docs.confluent.io/current/kafka-rest/docs/index.html)