How Log stash is different than Kafka?
and if both are same which is better? and How?
I found both are the pipelines where one can push the data for further processing.
Kafka is much more powerful than Logstash. For syncing data from such as PostgreSQL to ElasticSearch, Kafka connectors could do the similar work with Logstash.
One key difference is: Kafka is a cluster, while Logstash is basically single instance. You could run multiple Logstash instances. But these Logstash instances are not aware of each other. For example, if one instance goes down, others will not take over its work. Kafka handles the node down automatically. And if you set up Kafka connectors to work in the distributed mode, other connectors could take over the work of the down connector.
Kafka and Logstash could also work together. For example, run a Logstash instance on every node to collect logs, and send the logs to Kafka. Then you could write the Kafka consumer code to do any handling you want.
Logstash is a tool that can be used to collect, process and forward events and log messages. Collection is accomplished through a number of input plugins. You can use Kafka as an input plugin, where it will read events from a Kafka topic. Once an input plugin has collected data it can be processed by any number of filters which modify and annotate the event data. Finally events are routed to output plugins which can forward the events to a variety of external programs including Elasticsearch.
Where as Kafka is a messaging software that persists messages, has TTL, and the notion of consumers that pull data out of Kafka. Some of it's usages could be:
Stream Processing
Website Activity Tracking
Metrics Collection and Monitoring
Log Aggregation
So simply both of them have their own advantages and disadvantages. But then it depends on your requirements solely.
In addition, I want to add somethings through scenarios:
Scenario 1: Event Spikes
The app you deployed has a bad bug where information is logged excessively, flooding your logging infrastructure. This spike or a burst of data is fairly common in other multi-tenant use cases as well, for example, in the gaming and e-commerce industries. A message broker like Kafka is used in this scenario to protect Logstash and Elasticsearch from this surge.
Scenario 2: Elasticsearch not reachable
When eleasticsearch is not reachable, If you have a number of data sources streaming into Elasticsearch, and you can't afford to stop the original data sources, a message broker like Kafka could be of help here! If you use the Logstash shipper and indexer architecture with Kafka, you can continue to stream your data from edge nodes and hold them temporarily in Kafka. As and when Elasticsearch comes back up, Logstash will continue where it left off, and help you catch up to the backlog of data.
The whole blog is here about use cases of the Logtash and Kafka.
Related
I have a Spring Boot application which consumes from a few Kafka Topics for further processing and eventually push in a DB. It also has Kafka Streams which filter specific types of events from each topic and count how many of each specific type of event has been produced.
I also need to know how many of these events (of each specific types) has been consumed by my application. The Streams work perfectly in counting how many events are in a topic (aka has been produced) but I need the same functionality for consumed messages too. How would I go about implementing something like that?
As you're using Spring Boot I would recommend you to enable Micrometer
which will automatically collect the client metrics and enable you to easily integrate into your preferred Monitoring infra as it supports many of the most used ones.
The count is a metric that is already there so you can easily expose a metric and create a visualization chart with the count, calculate the rate, etc. Including getting insights per date ranges.
Spring kafka monitoring
Kafka client metrics.
We have confluents platform in our infrastructure. At core, we are using kafka broker to distribute events. Dozens of devices produce events to kafka topics (there is a kafka topic for each type of event), where events are serialized in google's protobuf. We have confluent's schema registry to keep track of the protobuf schemas.
What we need is, for several events, we need to apply some transformation and then publish the transformation output to some other kafka topic. Of course Kafka Streams is one way to accomplish that, like in this example. However, we don't want to have a java application for each transformation (which increase the complexity of the project and development/deployment effort), and it doesn't feels right to put all streams in one application (modifying one will require to stop all streams ans start again).
At this point, we thought that maybe Confluent's Kafka Connect might be better approach. We can have several workers, and we can deploy them into one kafka connect instance/or cluster. The question is;
Does it make sense to use kafka connect to get message from one kafka topic and send it to another kafka topic? Be cause all the use cases and examples aims to get data from outside (database, file etc.) to kafka, and from kafka to outside.
To clarify, Kafka Connect is not "Confluent's", it's part of Apache Kafka.
While you could use MirrorMaker2/Confluent Replicator with transforms, it honestly wouldn't be much different than extracting the transformation logic into a shared library, then bundling a deployable Kafka Streams application that accepts configuration parameters for input and output topics with the transformation in-between.
You make a good point about single-point of administration, but that's also a single point of failure... If you use Connect, changing your transform plugin will also require you to stop and restart the Connect server, if all topics are part of the same connector, then any task failure would stop some percentage of the topic transformations
Kafka Streams (or KSQL) is preferred for inter-cluster translations, anyway
You could also look at solutions like Apache Nifi for more complex event management and routing
In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things
I'm interested about what happen when Logstash fails to send events to output destination (for example, to a Kafka topic).
The single event will be lost or not? In case affermative, how to prevent losses?
Sounds like you might be interested in Persistent Queues.
Would like to highlight one section of it, which seems extreme
In order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which will store the message queue on disk. Persistent queues provide durability of data within Logstash.
Persistent queues are also useful for Logstash deployments that need large buffers. Instead of deploying and managing a message broker, such as Redis, RabbitMQ, or Apache Kafka, to facilitate a buffered publish-subscriber model, you can enable persistent queues to buffer events on disk and remove the message broker
But, I assume once queue is full, then messages would likely be dropped.
The alternative is to remove Logstash from the picture and use the Kafka Connect framework for external systems to Kafka, Filebeat/Fluentbit for watching file local changes, and Metricbeat/Telegraf for system metric monitoring.
I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.