how to implement a short lived queues for ETL - apache-kafka

I am looking for a suggestion on how we can implement a short lived queues(topic) to perform an ETL, after the ETL is completed that queue(topic) and data is not needed anymore.
Here is the scenario.. where a particular job runs, it has to run a query to extract data from database(assume teradata) and load it in a topic. Then a spark job will be kicked off and it will process all the records in that topic and stop the spark job. After that topic and data in that is not needed anymore.
For this I see Kafka and Redis stream as 2 options, looks to me Redis steam is the most appropriate tool because of ease of creating topics and destroying. with Kafka I see it requires additional custom handlers for creating the topics and drop the topic etc, also don't want to exploitate Kafka with too many topics.
I am open and happy to hear from you if we have another alternate and better solution out there.

Related

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Beam / Cloud Dataflow: How to Add Kafka (or PubSub) topics to Running Stream

(How) is it possible to dynamically add or remove topics to a running pipeline as a source or sink (Kafka or PubSub)? Or have as a sink a dynamic pattern like it is possible with BigQuery Table names.
Some background: We have different topics, one per customer, to better facilitate downstream aggregations and also clean/up add them on the fly. Kafka is used to be able to backfill calculations over periods that are longer than possible with PubSub.
The options I have in my mind right now are either extending KafkaIO to support this, or to update the pipeline each time there is a topic added removed (meaning there will be some lags in the stream while its updated). Or maybe I'm having a wrong design pattern in my head and there are other solutions for this.
You are correct that right now the easiest solution is updating the pipeline.
However, a new API called Splittable DoFn (SDF) is currently in active development; it is already available in the Cloud Dataflow runner in streaming mode and in the Direct runner, and implementation is in progress in Flink and Apex runners.
It makes it possible to do things like "create a PCollection of Kafka topic names and read each of those topics", so you can have one pipeline stage produce names of topics to be read (e.g. the names themselves could arrive over Kafka or Pubsub every time a customer is added, or you could write an SDF to watch the result of a database query returning a list of customers and emit new ones), and another stage reading those topics.
See http://s.apache.org/splittable-do-fn for the design doc of the API, and http://s.apache.org/textio-sdf for an example proposed refactoring of TextIO using this API - you may want to try to modify KafkaIO yourself in a similar fashion.

Kafka Streaming Concurrency?

I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.
To your question:
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But, fortunately, ...
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
...because Kafka allows a topic to have many partitions, you still get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.
The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.

can someone please suggest best way of doing log analysis using spark streaming

I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.