Is it possible to use Flume Kafka Source by itself? - apache-kafka

Let's say there are a bunch of producers writing avro records (that have the same schema) into a Kafak Topic.
Can I use Flume Kafka Source to read those records and write them to HDFS. Even though the records were not published using a Flume Sink?

Yes, you can. In general, what a Kafka consumer can or cannot do is fully independent on who produced the data, but only on the format in which the data has been encoded.
That is the whole point of Kafka and Enterprise Service Bus.

Related

KafkaStream Vs Flink

I have used the Flink for sending data from source to sink.
My flink app consumes the data from Kafka and send to the destination.
The destination is also kafka topic which has a different topic name.
The Flink is only used for delivering purpose without having any business logic.
In this case, I think that changing the flink to Kafka Stream will increase the throughput. Because the flink has no contribution except for delivering data from source to sink. Also my source and sink uses the kafka so I think that kafka streams will be faster in case of the delivery data.
I would appreciate if you could give you any opinion for my question.
Thanks.
There's no guarantee one will be faster than the other. You still need to do JVM and network tuning.
Either will work, but the limitation of Kafka Streams is that the data must remain in the same Kafka cluster. Flink has no such limitation.
Or you can simply use MirrorMaker for moving data between Kafka topics of different clusters.

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

Kafka Consumer Vs Apache Flink

I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.