How sink topic, kafka to kafka, using Flume? - apache-kafka

I am trying transfer log from topic to another topic. I need connect Kafka to Kafka using Flume. Take a look below:
#
# Flume Conf
#
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Kafka Source
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.kafka.bootstrap.servers = kafka:9092
a1.sources.s1.kafka.topics = apache
# Kafka Sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = kafka:9092
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000000
a1.channels.c1.transactionCapacity = 1000000
# Bind the source and sink to the channel
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
The sink are not creating.

If you want to replicate data from one Kafka cluster to another, there are better ways than Flume, including:
MirrorMaker, as #cricket_007 mentioned (open source, part of Apache Kafka)
Confluent's Replicator (commercial tool, 30 day free trial)
uReplicator (open sourced from Uber)
Mirus (open sourced from Salesforce)
Brucke (open source)
If you want a really dirty hack, you can also do something with kafkacat and nc.
Disclaimer: I work for Confluent.

Related

How Apache Flume and Kafka works together?

Regarding this configuration my understanding is flume is reading message to kafka topic source-topic , push this message/event to kafka channel/topic test-topic and then sink consume it and write it to ElasticSearch.
To test this flow, I explicitly pushed 1 message/event to kafka topic source-topic and was expecting this event on sink side. But it did not work for me.
Then I did some debugging on it and thought message / event must be in kafka channel. But when I tried to run the bin/kafka-topics.sh --list --zookeeper localhost:2181 command then it did not return test-topic on console.
Now my question is , is this channel name is not kafka topic ?
if not then how can I query the event from kafka channel or may be if someone can help me to understand this flow.
test.sources = ks
test.sinks = es
test.channels = kc
# SOURCES
test.sources.ks.type = org.apache.flume.source.kafka.KafkaSource
test.sources.ks.zookeeperConnect = 127.0.0.1:2181
test.sources.ks.topic = source-topic
test.sources.ks.groupId = cst
test.sources.ks.batchSize = 1000
test.sources.ks.batchDurationMillis = 1000
test.sources.ks.kafka.consumer.timeout.ms = 100
test.sources.ks.kafka.auto.offset.reset = smallest
# sink
test.sinks.es.type = org.es.TestElasticSearchSink
test.sinks.es.hostNames = 127.0.0.1:9200
test.sinks.es.indexName = test-idx
test.sinks.es.batchSize = 1000
test.sinks.es.iaCacheLifetime = 20
# Normal channel
test.channels.kc.type = org.kc.TestKafkaChannel
test.channels.kc.capacity = 10000
test.channels.kc.transactionCapacity = 1000
test.channels.kc.brokerList = 127.0.0.1:9092
test.channels.kc.topic = test-topic
test.channels.kc.zookeeperConnect = 127.0.0.1:2181
test.channels.kc.parseAsFlumeEvent = false
test.channels.kc.readSmallestOffset = true
test.channels.kc.groupId = test-flume
You will probably want to pre-create all necessary Kafka topics before starting Flume. However, it's not clear what is org.kc.TestKafkaChannel, or org.es.TestElasticSearchSink. Flume has provided classes for both of these (Kafka channel +Elasticsearch sink), I believe, so anything "not working" would begin in either of your "custom" classes here...
Alternatively, Kafka Connect already has an Elasticsearch sink connector, so you don't need an intermediate Kafka topic just to send data between Kafka and Elasticsearch. Logstash would work as well.

__consumer_offset is unable to sync

I am using mm2 with below properties
source(A),sink(B) clusters both have their own separate zookeeper
I consume some data from topic test in source A.
then I stopped consumer, and start mirror process
when I pointed consumer with same group id to sink then it start consuming from beginning. I am expecting it should start in sink from where it left off in source.
###############
A.bootstrap.servers = localhost:9092
B.bootstrap.servers = localhost:9093
A->B.enabled = true
A->B.topics = test
#B->A.enabled = true
#B->A.topics = .*
checkpoints.topic.replication.factor=1
heartbeats.topic.replication.factor=1
offset-syncs.topic.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
config.storage.replication.factor=1```
Since Kafka 2.7, MirrorMaker can automatically mirror consumer group offsets by setting sync.group.offsets.enabled=true.
In your example:
A->B.sync.group.offsets.enabled=true
Before 2.7, MirrorMaker does not automatically commit consumer group offsets and you need to use RemoteClusterUtils to do the offsets translation.

Kafka MirrorMaker 2.0 duplicate each messages

I am trying to replicate Kafka cluster with MirrorMaker 2.0. I am using following mm2.properties:
name = mirror-site1-site2
topics = .*
connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector
tasks.max = 1
plugin.path=/usr/share/java/kafka/plugin
clusters = site1, site2
# for demo, source and target clusters are the same
source.cluster.alias = site1
target.cluster.alias = site2
site1.sasl.mechanism=SCRAM-SHA-256
site1.security.protocol=SASL_PLAINTEXT
site1.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
username="<someuser>" \
password="<somepass>";
site2.sasl.mechanism=SCRAM-SHA-256
site2.security.protocol=SASL_PLAINTEXT
site2.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
username="<someuser>" \
password="<somepass>";
site1.bootstrap.servers = <IP1>:9093, <IP2>:9093, <IP3>:9093, <IP4>:9093
site2.bootstrap.servers = <IP5>:9093, <IP6>:9093, <IP7>:9093, <IP8>:9093
site1->site2.enabled = true
site1->site2.topics = topic1
# use ByteArrayConverter to ensure that records are not re-encoded
key.converter = org.apache.kafka.connect.converters.ByteArrayConverter
value.converter = org.apache.kafka.connect.converters.ByteArrayConverter
So here's the issue, mm2 seems to allways replicate x3 messages :
# Manual message production:
kafkacat -P -b <IP1>:9093,<IP2>:9093,<IP3>:9093,<IP4>:9093 -t "topic1"
# Result in the source topic (site1 cluster):
% Reached end of topic topic1 [2] at offset 405
Message1
% Reached end of topic topic1 [2] at offset 406
Message2
% Reached end of topic topic1 [6] at offset 408
Message3
% Reached end of topic topic1 [2] at offset 407
kafkacat -P -b <IP5>:9093,<IP6>:9093,<IP7>:9093,<IP8>:9093 -t "site1.topic1"
# Result in the target topic (site2 cluster):
% Reached end of topic site1.titi [2] at offset 1216
Message1
Message1
Message1
% Reached end of topic site1.titi [2] at offset 1219
Message2
Message2
Message2
% Reached end of topic site1.titi [6] at offset 1229
Message3
Message3
Message3
I tried using Kafka from confluent package and kafka_2.13-2.4.0 directly from Apache, both with Debian 10.1.
I first encouraged this behaviour with confluent 5.4, thought it could be a bug in their package as they have replicator and should not really care about mm2, but I reproduced exactly the same issue with kafka_2.13-2.4.0 directly from Apache without any change.
I'm aware that mm2 is not yet idempotent and can't guarantee once delivery. In my tests (I tried many things including producer tuning or bigger batch of thousand messages). In all these test mm2 always duplicate X3 all messages.
Did I miss something, did someone encourage the same thing ? As a site note with legacy mm1 with the same packages I don't have this issue.
Appreciate any help... Thanks !
Even if the changelog didnt made me very confident about an improvement I tried again to run a mm2, from kafka 2.4.1 this time. => no change allways these strange duplications.
I installed this released on a new server to ensure the strange behaviour I met wasnt something related to the server.
As I use ACL does I need special right ? I put "all" thinking it cant be more permisive... Even if mm2 isnt idempotent yep, I'll give a try to the right related to that.
That suprise me the more is that I cant find anything reporting an issue like this, for sure I must do something wrong, but what that is the question...
You need to remove connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector from your configuration, because this is telling Mirror Maker to use this class for Heartbeats and Checkpoints connectors that it generates along with the Source connector that replicates data, and this class makes them behave exactly like a Source connector, so that's why you get 3 messages replicated each time, you've actually generated 3 Source connectors.
Enabling idempotence to the client config will fix the issue. By default it will be set to false. Add the below to the mm2.properties file
source.cluster.producer.enable.idempotence = true
target.cluster.producer.enable.idempotence = true

Streaming Kafka Messages to MySQL Database

I want to write Kafka messages to MySQL database. There is an example in this link. In that example, apache flume used for consume messages and writing it to MySQL. I'm using same code and when i run the flume-ng agent and event always becomes null
And my flume.conf.properties file is:
agent.sources=kafkaSrc
agent.channels=channel1
agent.sinks=jdbcSink
agent.channels.channel1.type=org.apache.flume.channel.kafka.KafkaChannel
agent.channels.channel1.brokerList=localhost:9092
agent.channels.channel1.topic=kafkachannel
agent.channels.channel1.zookeeperConnect=localhost:2181
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapacity=1000
agent.channels.channel1.parseAsFlumeEvent=false
agent.sources.kafkaSrc.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSrc.channels = channel1
agent.sources.kafkaSrc.zookeeperConnect = localhost:2181
agent.sources.kafkaSrc.topic = kafka-mysql
agent.sinks.jdbcSink.type = com.stratio.ingestion.sink.jdbc.JDBCSink
agent.sinks.jdbcSink.connectionString = jdbc:mysql://127.0.0.1:3306/test?useSSL=false
agent.sinks.jdbcSink.username=root
agent.sinks.jdbcSink.password=pass
agent.sinks.jdbcSink.batchSize = 10
agent.sinks.jdbcSink.channel =channel1
agent.sinks.jdbcSink.sqlDialect=MYSQL
agent.sinks.jdbcSink.driver=com.mysql.jdbc.Driver
agent.sinks.jdbcSink.sql=INSERT INTO kafkamsg(msg) VALUES(${body:varchar})
Where I'm wrong?
Thanks.
In my referance example, flume listens kafka for kafka-mysql topic. But this code works for kafkachannel topic. So we need to produce messages to kafkachannel topic, i don't know why.

Flume Kafka sink not able to write complete messages to Kafka Broker

I have written a process where I'm generating messages thru custom flume source and Flume Kafka sink provided by Hortonworks to write into Kafka brokers.
During this process i have noticed that if KAFKA broker is already running and then i start my Flume agent it delivers each and every message to the Kafka broker properly but when i starts Kafka broker when Flume agent is already running, KAFKA broker is not able to receive all the messages.
When i run Kafka Console consumer to check the counts of messages received i noticed it is dropping few records from beginning and few records from the end.
I have tried multiple mix and match in Flume.conf but still it is working as expected.
Below are the configuration parameter which i have provided to
Flume.conf -
agent.channels = firehose-channel
agent.sources = stress-source
agent.sinks = kafkasink
#################################
# Benchmark Souce Configuration #
#################################
agent.sources.stress-source.type=com.kohls.flume.source.stress.BenchMarkTestScenriao
agent.sources.stress-source.size=5000
agent.sources.stress-source.maxTotalEvents=30000
agent.sources.stress-source.batchSize=200
agent.sources.stress-source.throughputThreshold=4000
agent.sources.stress-source.throughputControlSeconds=1
agent.sources.stress-source.channels=firehose-channel
#################################
# Firehose Channel Configuration #
#################################
agent.channels.firehose-channel.type = file
agent.channels.firehose-channel.checkpointDir = /data/flume/checkpoint
agent.channels.firehose-channel.dataDirs = /data/flume/data
agent.channels.firehose-channel.capacity = 10000
agent.channels.firehose-channel.transactionCapacity = 10000
agent.channels.firehose-channel.useDualCheckpoints=1
agent.channels.firehose-channel.backupCheckpointDir=/data/flume/backup
############################################
# Firehose Sink Configuration - Kafka Sink #
############################################
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.topic = backoff_test_17
agent.sinks.kafkasink.channel=firehose-channel
agent.sinks.kafkasink.brokerList = sandbox.hortonworks.com:6667
agent.sinks.kafkasink.batchsize = 200
agent.sinks.kafkasink.requiredAcks = 1
agent.sinks.kafkasink.kafka.producer.type = async
agent.sinks.kafkasink.kafka.batch.num.messages = 200
I have also tried to analyses the flume log and noticed that the flume metrics are properly showing the PUT and TAKE count.
Please let me know if anyone has any pointer to solve this issue. Appreciating your help in advance.