I'm interested about what happen when Logstash fails to send events to output destination (for example, to a Kafka topic).
The single event will be lost or not? In case affermative, how to prevent losses?
Sounds like you might be interested in Persistent Queues.
Would like to highlight one section of it, which seems extreme
In order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which will store the message queue on disk. Persistent queues provide durability of data within Logstash.
Persistent queues are also useful for Logstash deployments that need large buffers. Instead of deploying and managing a message broker, such as Redis, RabbitMQ, or Apache Kafka, to facilitate a buffered publish-subscriber model, you can enable persistent queues to buffer events on disk and remove the message broker
But, I assume once queue is full, then messages would likely be dropped.
The alternative is to remove Logstash from the picture and use the Kafka Connect framework for external systems to Kafka, Filebeat/Fluentbit for watching file local changes, and Metricbeat/Telegraf for system metric monitoring.
Related
my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.
Lately I've been looking into real-time data processing using storm, flink, etc...
All architectures I came through uses kafka as a layer between datasources and the stream processor, why this layer should exist ?
I think there are three main reasons why to use Apache Kafka for real-time processing:
Distribution
Performance
Reliability
In real-time processing, there is a requirement for fast and reliable delivery of data from data-sources to stream processor. If u are not doing it well, it can easily become a bottleneck of your real-time processing system. Here is where Kafka can help.
Before, traditional messaging ApacheMQ and RabbitMQ was not particularly good for handling huge amount of data in real-time. For that reason Linkedin engineers developed their own messaging system Apache Kafka to be able to cope with this issue.
Distribution: Kafka is natively distributed which fits to distribution nature of stream processing. Kafka divides incoming data to partition ordered by offset which are physically distributed over the cluster. Then these partition can feed the stream processor in distributed manner.
Performance:
Kafka was designed to be simple, sacrificing advance features for the sake of performance. Kafka outperform traditional messaging systems by big difference which can be seen also in this paper. The main reasons are mentioned below:
The Kafka producer does not wait for acknowledgments from the broker
and send data as fast as broker can handle
Kafka has a more efficient storage format with less meta-data.
The Kafka broker is stateless, it does not need to take care about the state of consumers.
Kafka exploits the UNIX sendfile API to efficiently deliver data from
a broker to a consumer by reducing the number of data copies and
system calls.
Reliability: Kafka serves as a buffer between data sources and the stream processor to handle a big load of data. Kafka just simple store all the incoming data and consumers are responsible for the decision how much and how fast they want to process data. This ensure reliable load-balancing that the stream processor will be not overwhelmed by too many data.
Kafka retention policy also allows to easy recover from failures during processing (Kafka retain all the data for 7 days by default). Each consumers keep track on offset of last processed message. For this reason if some consumer fails, it is easy to rollback to the point right before failure and start processing again without loosing information or need to reprocess all stream from beginning.
How Log stash is different than Kafka?
and if both are same which is better? and How?
I found both are the pipelines where one can push the data for further processing.
Kafka is much more powerful than Logstash. For syncing data from such as PostgreSQL to ElasticSearch, Kafka connectors could do the similar work with Logstash.
One key difference is: Kafka is a cluster, while Logstash is basically single instance. You could run multiple Logstash instances. But these Logstash instances are not aware of each other. For example, if one instance goes down, others will not take over its work. Kafka handles the node down automatically. And if you set up Kafka connectors to work in the distributed mode, other connectors could take over the work of the down connector.
Kafka and Logstash could also work together. For example, run a Logstash instance on every node to collect logs, and send the logs to Kafka. Then you could write the Kafka consumer code to do any handling you want.
Logstash is a tool that can be used to collect, process and forward events and log messages. Collection is accomplished through a number of input plugins. You can use Kafka as an input plugin, where it will read events from a Kafka topic. Once an input plugin has collected data it can be processed by any number of filters which modify and annotate the event data. Finally events are routed to output plugins which can forward the events to a variety of external programs including Elasticsearch.
Where as Kafka is a messaging software that persists messages, has TTL, and the notion of consumers that pull data out of Kafka. Some of it's usages could be:
Stream Processing
Website Activity Tracking
Metrics Collection and Monitoring
Log Aggregation
So simply both of them have their own advantages and disadvantages. But then it depends on your requirements solely.
In addition, I want to add somethings through scenarios:
Scenario 1: Event Spikes
The app you deployed has a bad bug where information is logged excessively, flooding your logging infrastructure. This spike or a burst of data is fairly common in other multi-tenant use cases as well, for example, in the gaming and e-commerce industries. A message broker like Kafka is used in this scenario to protect Logstash and Elasticsearch from this surge.
Scenario 2: Elasticsearch not reachable
When eleasticsearch is not reachable, If you have a number of data sources streaming into Elasticsearch, and you can't afford to stop the original data sources, a message broker like Kafka could be of help here! If you use the Logstash shipper and indexer architecture with Kafka, you can continue to stream your data from edge nodes and hold them temporarily in Kafka. As and when Elasticsearch comes back up, Logstash will continue where it left off, and help you catch up to the backlog of data.
The whole blog is here about use cases of the Logtash and Kafka.
I am working on an application that processes very few records in a minute. The request rate would be around 2 calls per minute. These requests are create and update made for a set of data. The requirements were delivery guarantee, reliable delivery, ordering guarantee and preventing any loss of messages.
Our team has decided to use Kafka and I think it does not fit the use case since Kafka is best suitable for streaming data. Instead we could have been better off with traditional message model as well. Though Kafka does provide ordering per partition, the same can be achieved on a traditional messaging system if the number of messages Is low and sources of data is also low. Would that be a fair statement ?
We are using Kafka streams for processing the data and the processing requires that we do lookups to external systems. If the external systems are not available then we stop processing and automatically deliver messages to target systems when the external lookup systems are available.
At the moment, we stop processing by continuously looping in the middle of a processing and checking if the systems are available.
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Regarding your point 2:
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
If, as in your case, you have a very low incoming data rate (a few records per minute), then it might be ok to pause processing an input stream when required dependency systems are not available currently.
In Kafka Streams the preferable API to implement such a behavior -- which, as you are alluding to yourself, is not really a recommended pattern -- is the Processor API.
Even so there are a couple of important questions you need to answer yourself, such as:
What is the desired/required behavior of your stream processing application if the external systems are down for extended periods of time?
Could the incoming data rate increase at some point, which could mean that you would need to abandon the pausing approach above?
But again, if pausing is what you want or need to do, then you can give it a try.
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Some stream processing tools allow you to do that. Whether it's the best pattern to use them is a different question.
For instance, you could also consider the following alternative: You could automatically ingest the external systems' data into Kafka, too, for example via Kafka's built-in Kafka Connect framework. Then, in Kafka Streams, you could read this exported data into a KTable (think of this KTable as a continuously updated cache of the latest data from your external system), and then perform a stream-table join between your original, low-rate input stream and this KTable. Such stream-table joins are a common (and recommended) pattern to enrich an incoming data stream with side data (disclaimer: I wrote this article); for example, to enrich a stream of user click events with the latest user profile information. One of the advantages of this approach -- compared to your current setup of querying external systems combined with a pausing behavior -- is that your stream processing application would be decoupled from the availability (and scalability) of your external systems.
is only a fair statement for traditional message brokers when there is a single consumer (i.e. an exclusive queue). As soon as the queue is shared by more than one consumer, there will be the possibility of out of order delivery of messages. This is because any one consumer might fail to processes and ACK a message resulting in the message being put back at the head of the shared queue, and subsequently delivered (out of order) to another consumer. Kafka guarantees in order parallel consumption across multiple consumers using topic partitions (which are not present in traditional message brokers).
I have been reading a lot of articles where implementations of Apache Storm are explained for ingesting data from either Apache Flume or Apache Kafka. My main question remains unanswered after reading several articles. What is the main benefit of using Apache Kafka or Apache Flume? Why not collecting data from a source directly into Apache Storm?
To understand this I looked into these frameworks. Correct me if I am wrong.
Apache Flume is about collecting data from a source and pushing data to a sink. The sink being in this case Apache Storm.
Apache Kafka is about collecting data from a source and storing them in a message queue until Apache Storm processes it.
I am assuming you are dealing with the use case of Continuous Computation Algorithms or Real Time Analytics.
Given below is what you will have to go through if you DO NOT use Kafka or any message queue:
(1) You will have to implement functionality like consistency of data.
(2) You are ready to implement replication on your own
(3) You are ready to tackle a variety of failures and ready to build a fault tolerant system.
(4) You will need to create a good design so that your producer and consumer are completely decoupled.
(5) You will have to implement persistence. What happens if your consumer fails?
(6) What happens to fault resilience? Do you want to take the entire system down when your consumer fails?
(7) You will have to implement delivery guarantees as well as ordering guarantees.
All of the above are inherent features of a message queue (Kafka etc.) and you will of-course not like to re-invent the wheel here.
I think the reason for having different configurations could be a matter of how the source data is obtained. Storm spouts (the first elements in the Storm topologies) are meant to synchronously polling for the data, while Flume agents (agent=source+channel+sink) are meant to asynchronously receive the data at the source. Thus, if you have a system that notifies certain events then a Flume agent is required; then this agent would be in charge of receiving the data and putting into any queue management system (ActiveMQ, RabbitMQ...) in order to be cosumed by Storm. The same would apply to Kafka.