Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding the tiered storage feature in Kafka. I like this feature since it means in my case that I can use Kafka as infinite storage (with gcs backend for example). However, let's suppose for whatever reason the Kafka cluster got deleted and Kafka data is lost.
Is data in gcs/s3 store still useful?
I mean can I plug the old logs to a new Kafka cluster or is it totally useless now (terabytes of logs)?
BTW I know I can analyse the segments in the gcs/S3 store and extract data. but that's a bit hacky that's why I m trying to see if I can find a clean solution.
As of right now, if the cluster or specifically the topic that has tiered storage enabled gets deleted, the data in GCS/S3 will not be "reloaded" if you connect it to another cluster.
If you want to keep the data that's in GCS/S3, you will need to stream the data to a new topic that does not have tiered storage enabled or use kafka connect to independently write the data to a usable format before deleting it.
We do plan on improving this use case in the future.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I would like to transfer data from one db system to any other db systems. which messaging system(Kafka, ActiveMQ, RabbitMQ.....likewise) would be better to achieve this with much throughput and performance.
I guess the answer for this type of questions is "it depends"
You probably could find many information on the internet about comparison between these message brokers,
as far as I can share from our experience and knowledge, Kafka and its ecosystem tools like kafka connect , introducing your requested behavior with source connectors and sink connectors using kafka in the middle,
Kafka connect is a framework which allows adding plugin called connectors
Sink connectors- reads from kafka and send that data to target system
Source connector- read from source store and write to kafka
Using kafka connect is "no code", calling rest api to set configuration of the connectors.
Kafka is distributed system that supports very high throughout with low latency. It supports near real time streaming of data.
Kafka is highly adopted by the biggest companies around the world.
There are many tools and vendors that support your use case, they vary in price and support, it depends from which sources you need to take data and to which targets you wish to write, should it be cdc/near real time or "batch" copy
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
This post was edited and submitted for review 3 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I'm very naive about data engineering but it seems to me that a popular pipeline for data used to be Kafka to Storm to something.... but as I understand it Kafka now seems to have data processing capabilities that may often render Storm unnecessary. So my question is simply, in what scenarios might this be true that Kafka can do it all, and in what scenarios might Storm still be useful?
EDIT:
Question was flagged for "opinion based".
This question tries to understand what capabilities Apache Storm offers that Apache Kafka Streaming does not (now that Kafka Streaming exists). The accepted answer touches on that. No opinions are requested by this question nor are they necessary to address the question. Question title edited to seem more objective.
You still need to deploy the Kafka code somewhere, e.g. YARN if using Storm.
Plus, Kafka Streams can only process between the same Kafka cluster; Storm has other spouts and bolts. But Kafka Connect is one alternative to that.
Kafka has no external dependency of a cluster scheduler, and while you may deploy Kafka clients in almost any popular programming language, it still requires external instrumentation, whether that's a Docker container or deployed on bare-metal.
If anything, I'd say Heron or Flink are true comparative replacements for Storm
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have to decide on a NoSQL database for a web app that should keep track of the user input and update the corresponding record as frequent as possible. To think about the frequency: let's say a blank record is generated on start and it should update on every key-up event coming from the user.
The methods I have seen for this kind of work are:
Write-Ahead-Logging/Journaling for the user data (not like the internal data consistency methods like Journaling of MongoDB or Write-ahead logging of CouchDb): I don't know if it is even implemented for the user data or the current methods can be utilized for this purpose.
Versioning for MongoDB or a less implicit cell versioning way of doing it in Cassandra
I tended to use Cassandra at the beginning, but I want to know the best fit methods* to achieve that kind of a scenario.
In Cassandra frequent updates on a cell can (but must not) lead to problems with compactions (to be more specific, when updated data is flushed from memtables to sstables because of too many concurrent updates.
If you do not need this data persisted an in memory solution (or in addition to a database) could help, I used Hazelcast for this.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Among the different options of a write-journal to implement Event Sourcing, Kafka seems a very reasonable choice from "outside":
It has a great ecosystem
It is well documented
It naturally supports streaming and listeners
However, looking into Akka persistence, it appears that Kafka journal is supported only through a community contributed package, which has not been modified for the last 2 years. Is Kafka not a good option, are there better options and if it is the best option, how are people using it with akka-persistance?
The problems with using Kafka as the event journal for akka-persistence (lack of atomic writes) are mentioned in this comment, which also lists it as a reason the plugin hasn't been maintained:
https://github.com/krasserm/akka-persistence-kafka/issues/28#issuecomment-138933868
In this thread, however, there is evidence that people are working on forks that work with latest kafka and akka versions:
https://github.com/krasserm/akka-persistence-kafka/issues/20
You should have a look right here
It's a pull request of the fork maintained right here
This version uses kafka 1.0 and the new producer API with transaction. We try to respect the best the akka persistence specification. We keep on with kafka because, for us, it is the best solution for event sourcing.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I just started learning about IoT and data streaming. Apologies if this question seems too obvious or generic.
I am working on a school project, which involves streaming data from hundreds (maybe thousands) of Iot sensors, storing said data on a database, then retrieving that data for display on a web-based UI.
Things to note are:
fault-tolerance and the ability to accept incomplete data entries
the database has to have the ability to load and query data by stream
I've looked around on Google for some ideas on how to build an architecture that can support these requirements. Here's what I have in mind:
Sensor data is collected by FluentD and converted into a stream
Apache Spark manages a cluster of MongoDB servers
a. the MongoDB servers are connected to the same storage
b. Spark will handle fault-tolerance and load balancing between MongoDB servers
BigQuery will be used for handling queries from UI/web application.
My current idea of a IoT streaming architecture :
The question now is whether this architecture is feasible, or whether it would work at all. I'm open to any ideas and suggestions.
Thanks in advance!
Note that you could stream your device data directly into BigQuery and avoid an intermediate buffering step.
See:
https://cloud.google.com/bigquery/streaming-data-into-bigquery