Real Time event processing - apache-kafka

I really want to get an architectural solution for my below scenario.
I have a source of events (Say sensors in oil wells , around 50000 ), that produces events to a server. At the server side I want to process all these events in such a way that , the information from the sensors about latest humidity, temperature,pressure ...etc will be stored/updated to a database.
I am confused with flume or kafka.
Can somebody please address my simple scenario in architectural terms.
I don't want to store the event somewhere, since I am already updating the database with latest values.
Should I really need spark , (flume/kafka) + spark , to meet the processing side?.
Can we do any kind of processing using flume without a sink?

Sounds like you need to use the Kafka producer API to publish the events to a topic then simply read those events either by using the Kafka consumer API to write to your database or use the Kafka JDBC sink connector.
Also if you need just the latest data inside Kafka take a look at log compaction.

One way would be to push all the messages to Kafka Topic. Using Spark Stream you can ingest and process from the kafka topic. Spark streaming can directly process from your Kafka Topic

Related

MQTT topics and kafka topics mapping

I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

Apache Kafka Lineage Information

Is there any way to obtain lineage data for Kafka jobs.
Like for example we have a Job History URL for all the MapReduce jobs.
Is there anything similar for Kafka where I can get the metadata of the Producer producing information to a particular topic? (eg: IP address of the producer)
Out of the box, no, not for Producers.
You can list consumer client addressses, but not see producers into a topic.
One option would be to look into Apache Atlas for lineage information, which is one component of Hortonwork's Stream Messaging Manager for analyzing Kafka connection information.
Otherwise, you would be stuck trying to enfore producers to send this data along with the messages.

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

can someone please suggest best way of doing log analysis using spark streaming

I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.