kafka + reading from topic log file - apache-kafka

I have a topic log file and the corresponding .index file. I would like to read the messages in a streaming fashion and process it. How and where should I start?
Should I load these files to Kafka producer and read from topic?
Can i directly write a consumer to read data from the file and process it?
I have gone through the Kafka website and everywhere, it uses pre-built Kafka producers and consumers in the examples. So, I couldn't get enough guidance.
I want to read in streaming fashion in Java.
The text looks encrypted so i am not posting the input files.
Any help is really appreciated.

You can dump log segments and use the deep iteration option to deserialize the data into something more readable.
If you want to "stream it", then use a standard Unix pipe to output to some other tool
do aggregate operations
Then use Kafka Streams to actually read from the topic for all partitions rather than the single partition on that single broker

Related

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

send event to specific kafka topic partition and read it by flume hive sink

I am using Kafka 0.10 and Flume 1.8. I am trying to get information on below (but could not get it). So can any body please help me.
Is there any way to send events to particular kafka topic partition
And if so, then can we read such events (coming to specific partition) with flume using hive sink
I'm not sure I understand your motive... I'm pretty sure you can create a kafka topic with a single partition if you wish to.
By doing this, you would know which partition and topic you were reading from. It is also possible to have multiple sources in flume, so if you wish for a single service to read from multiple topics but for each topic to only have a single partition, you can easily do this.
Apologies, I would have written this as a comment as it should really be a comment but I don't yet have that privilege in stackoverflow. Anyway, I hope this helps.

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

can someone please suggest best way of doing log analysis using spark streaming

I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.