Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 19 days ago.
Improve this question
I have to implement a solution that consists of processing a large amount of data by applying business requirement rules. The input and the output will be a file.
I haven't been using Kafka before, I am wondering if I can use Kafka streams to process these rules or use spring batch combined with Kafka streams.
Is there any other frameworks/technologies that can be used in Java?
Thank you
Kafka Streams is a stream processing solution; what you're talking about is more of a batch workload. The difficulties you will encounter using KStreams are:
Kafka Streams doesn't have a good way of working with files as input and output.
In Stream Processing, there's no real concept of "beginning" and "end," whereas I gather from the nature of your question that you do indeed have a beginning and end in your use-case.
As such I would recommend another batch solution.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I have a bunch of scripts collecting data from internet and local services, writing them to disk, scripts transforming the data and writing it into a database, scripts reading data from the database and generating new data, etc, written in bash, Python, SQL, ... (Linux).
Apart from a few time-triggered scripts, the glue between the scripts is currently me, running the scripts now and then in a particular order to update everything.
What is the simplest way to replace me by a tool that observes dependencies and triggers the next step as soon as the preconditions are met?
I've found many ETL and data warehousing tools, but these seem too heavy weight for my simple setting. I'd prefer a CLI solution with text-based configuration (maybe able to visualise the graph of dependencies). Any suggestions?
Try airflow: airflow.apache.org
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I'm want to learn about the differences between the two methods. I developed a project so It aggregates some data using Apache Kafka Streams API. And after that, I got on some solutions which are written with KSQL.
I've never got experienced with KSQL so I would like to learn when and which approach should select for aggregate some stuff? Could I use KSQL instead of Kafka Streams?
There's a blog post somewhere that talks about the "Kafka abstraction funnel"
KSQL doesn't provide as much flexibility as Kafka Streams, which in turn, abstracts many details of the core consumer/producer API.
If you have people more familiar with SQL, and not so good at other client libraries, you'd use KSQL. If you run into a feature not supported by KSQL (think, custom, complex data types) or need to embed streaming logic into a larger application without needing to remotely query the KsqlDB rest api, use Kafka Streams
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
How easy is to switch from Rabbit to Kafka in existing solution, to replace one implementation (Rabbit) with other (Kafka)? We are about to use Rabbit in our implementation but we want to see if it is possible in the future to replace it with Kafka.
It is possible, and I've seen people do it - but it is a big project.
Not only the APIs are different, but the semantics are different. So you need to rethink your data model, scaling model, error handling, etc. And then there's testing.
If you don't have tons of code to update, and the code is localized and you have both RabbitMQ and Kafka experts on the team you may be able to get it done in a month or two.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I want to debug some Kafka topics so I know if the consumer or producer is at fault here.
Is there a UI for Kafka where I can see what messages a topic contain?
A dumper would also be nice so I can search for stuff on my own.
We use Landoop's Kafka Topics UI, which is pretty good. You can see topic contents and information (e.g. number of partitions, configuration, etc) and also export topic contents.
I'll second Yoni Gibb's suggestion of the Landoop product. I also use it in development and find it very useful; although you may need to tweak a few settings around timeout and size in order to see all messages. Easy to install, just pull the Docker image.
Kafkacat is useful too, but it's not quite as good for being able to monitor many topics at once and be left running.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am working with Spark.
I want the Spark application to be a long running application which doesn't exit after it finishes computation and to listen for HTTP requests and return the computed data.
How can I do it out-of-the box? Now, I can only write a while loop to ensure the program keeps going.
Spark doesn't have such features out of the box. Spark streaming does via awaitTermination() method which you call on the StreamingContext. Then you just need to implement a HTTP endpoint in your Spark application.
Using Spark Streaming functionality would be the easiest, you can still leave your Spark jobs to use regular RDDs and not DStreams, you can use the StreamingContext just for the awaitTermination.
If you don't want to use Spark Streaming you still probably can have a look at how they implement it using locks here ContextWaiter#waitForStopOrError()