Flume agent producing multiple .tmp files when data is sent in succession - apache-kafka

I have a flume agent running in CDH 5.8.3. It creates multiple .tmp files when writing to hdfs if more than 3 valid files are sent. There is an interceptor that routes valid xmls to appropriate topic before the hdfs sink. This agent is using flafka. Interceptor and kafka are working correctly.
agent.sinks.hdfs_valid.channel=valid_channel
agent.sinks.hdfs_valid.type=hdfs
agent.sinks.hdfs_valid.writeFormat=Text
agent.sinks.hdfs_valid.hdfs.fileType=DataStream
agent.sinks.hdfs_valid.hdfs.filePrefix=event
agent.sinks.hdfs_valid.hdfs.fileSuffix=.xml
agent.sinks.hdfs_valid.hdfs.path=locationoffile/%{time}
agent.sinks.hdfs_valid.hdfs.idleTimeout=900
agent.sinks.hdfs_valid.hdfs.rollInterval=3600
agent.sinks.hdfs_valid.hdfs.kerberosPrincipal=authentication#example.com
agent.sinks.hdfs_valid.hdfs.kerberosKeytab=locationofkeytab
agent.sinks.hdfs_valid.hdfs.rollSize=0
agent.sinks.hdfs_valid.hdfs.rollCount=0
agent.sinks.hdfs_valid.hdfs.callTimeout=100000

Okay so interesting enough. Our Kafka partitions was set to 20. When flume consumes from it. The first 10 partitions are consuming from one ip and it opens a .tmp. The second 10 partitions are consuming from another ip and it opens a second .tmp. This appears to be an internal function of flume. All data arrived correctly despite having two .tmp opened.

Related

Kafka file stream connect and stream API

am working on the file stream connector, I have more than ten million records in the file(it's not a single file, its partition by account #). I have to load these files into the topic and update my streams. have gone through stand-alone streams, I have the following question and need help to achieve.
look at the data set, I have two account#, each account has 5 rows, I would need to group them in two rows and key as acctNbr.
how to write my source connector to read the file and get the grouping logic?
my brokers are running in Linux machines X,Y,Z.. post-development of source connector, my jar file should it deploy in every broker(if I start running in the distributed broker )?
I have only 30 mins window to extract file drop to the topic? what are all the parameters that are there to tune the logic to get my working window down? FYI, this topic would have more than 50 partitions and 3 broker set up.
Data set:
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
how to write my source connector to read the file and get the grouping logic
FileSream connector cannot do this, and was not intended for such a purpose other than an example to write your own connectors. In other words, do not use in production.
That being said, you can use alternative solutions like Flume, Filebeat, Fluentd, NiFi, Streamsets, etc, etc, to glob your filepaths, then send all records line-by-line into a Kafka topic.
post-development of source connector, my jar file should it deploy in every broker
You should not run Connect on any broker. The Connect servers are called workers.
have only 30 mins window to extract file drop to the topic?
Not clear where this number came from. Any of the above methods listed above watch for all new files, without any defined window.

Why my Kafka connect sink cluster only has one worker processing messages?

I've recently setup a local Kafka on my computer for testing and development purposes:
3 brokers
One input topic
Kafka connect sink between the topic and elastic search
I managed to configure it in standalone mode, so everything is localhost, and the Kafka connect was started using ./connect-standalone.sh script.
What I'm trying to do now is to run my connectors in distributed mode, so the Kafka messages can be separated into both workers.
I've started the two workers (still everything on the same machine), but when I send message to my Kafka topic, only one worker (the last started) is processing messages.
So my question is: Why only one worker is processing Kafka messages instead of both ?
When I kill one of the worker, the other one takes the message flow back, so I think the cluster is well setup.
What I think:
I don't put Keys inside my Kafka messages, can it be related to this ?
I'm running everything in localhost, does distributed mode can work this way ? (I've correctly configure specific unique field such as ret.port)
Resolved:
From Kafka documentation:
The division of work between tasks is shown by the partitions that each task is assigned
If you don't use partition (push all messages in same partition), workers won't be able to divide messages.
You don't need to use message keys, you can just push your messages to different partition in a cyclic way.
See: https://docs.confluent.io/current/connect/concepts.html#distributed-workers

Flink kafka - Flink job not sending messages to different partitions

I have the below configuration:
One kafka topic with 2 partitions
One zookeeper instance
One kafka instance
Two consumers with same group id
Flink job snippet:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new
SimpleStringSchema(), props));
Scenario 1:
I have written a flink job (Producer) on eclipse which is reading a file from a folder and putting the msgs on kafka topic.
So when i run this code using eclipse, it works fine.
For example : If I place a file with 100 records, flink sends few msgs to partition 1 & few msgs to partition 2 and hence both the consumers gets few msgs.
Scenario 2:
When i create the jar of the above code and run it on flink server, flink sends all the msgs to a single partition and hence only one consumer get all the msgs.
I want the scenario 1 using the jar created in scenario 2.
For Flink-Kafka Producers, add "null" as the last parameter.
speStream.addSink(new FlinkKafkaProducer011(
kafkaTopicName,
new SimpleStringSchema(),
props,
(FlinkKafkaPartitioner) null)
);
The short explanation for this is that this turns off Flink from using the default partitioner FlinkFixedPartitioner. This being turned off as the default will allow Kafka to distribute the data amongst its partitions as it sees fit. If this is NOT turned off, then each parallelism/task slot used for the sink that utilizes the FlinkKafkaProducer will only write to one partition per parallelism/task slot.
If you do not provide a FlinkKafkaPartitioner or do not explicitly say to use Kafka's one a FlinkFixedPartitioner will be used, meaning that all events from one task will end up in the same partition.
To use Kafka's partitioner use this ctor:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(), props), Optional.empty());
The difference between running from IDE and eclipse are probably because of different setup for parallelism or partitioning within Flink.

Load 1GB of file to Kafka producer directly from my local machine

I have experimented with the basic examples of publishing random messages from producer to consumer by command line.
Now i want to publish all the 1GB of data present in my local machine. For that i am struggling to load that 1GB of data to producer.
Help me out please.
You can simply dump a file by simple redirection to kafka topic. Assuming 1.xml is 1GB file then you can use following command.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test123 < ./1.xml
But please make sure that you set following properties in producer xml.
socket.request.max.bytes, socket.receive.buffer.bytes, socket.send.buffer.bytes.
You need to set max.message.bytes for test123 topic if your message size is big.
Also change Xmx parameter in console-producer.sh to avoid Out of Memory issue.
These are the general steps to load data in kafka.
We will able to understand more if you provide the error.
So couple of approaches can help:
1) You can use big data platforms like Flume which are built for such use cases.
2) If you want to implement you own code then you can use Apache commons Lib which will help you in capturing events when a new file arrives in folder (Capture events happening inside a directory) and once you have that then you can call the code which publishes the data on kafka.
3) In our project we use Logstash API to do the same which fetches from a folder and publishes data from file to kafka and then processes it through Storm.

Kafka source vs Avro source for reading and writing data into kafka channel using flume

In flume, I have Kafka-channel from where I can read and write data.
What is the difference between the performance of reading and writing data into Kafka channel if I replace Kafka source and Kafka sink with Avro source and Avro sink?
In my opinion, by replacing Kafka-source with Avro-source, I will be unable to read data in parallel from multiple partitions of Kafka broker, as there is no consumer group specified in case of Avro-source. Please correct me if I am wrong.
In Flume, the Avro RPC source binds to a specified TCP port of a network interface, so only one Avro source of one of the Flume agents running on a single machine can ever receive events sent to this port.
Avro source is meant to connect two or more Flume agents together: one or more Avro sinks connect to a single Avro source.
As you point out, using Kafka as a source allows for events to be received by several consumer groups. However, my experience with Flume 1.6.0 is that it is faster to push events from one Flume agent to another on a remote host through Avro RPC rather than through Kafka.
So I ended up with the following setup for log data collection:
[Flume agent on remote collected node] =Avro RPC=> [Flume agent in central cluster] =Kafka=> [multiple consumer groups in central cluster]
This way, I got better log ingestion and processing throughput and I also could encrypt and compress log data between remote sites and central cluster. This may however change when Flume adds support for the new protocol introduced by Kafka 0.9.0 in a future version, possibly making Kafka more usable as the front interface of the central cluster with remote data collection nodes (see here).