How Apache Flume and Kafka works together? - apache-kafka

Regarding this configuration my understanding is flume is reading message to kafka topic source-topic , push this message/event to kafka channel/topic test-topic and then sink consume it and write it to ElasticSearch.
To test this flow, I explicitly pushed 1 message/event to kafka topic source-topic and was expecting this event on sink side. But it did not work for me.
Then I did some debugging on it and thought message / event must be in kafka channel. But when I tried to run the bin/kafka-topics.sh --list --zookeeper localhost:2181 command then it did not return test-topic on console.
Now my question is , is this channel name is not kafka topic ?
if not then how can I query the event from kafka channel or may be if someone can help me to understand this flow.
test.sources = ks
test.sinks = es
test.channels = kc
# SOURCES
test.sources.ks.type = org.apache.flume.source.kafka.KafkaSource
test.sources.ks.zookeeperConnect = 127.0.0.1:2181
test.sources.ks.topic = source-topic
test.sources.ks.groupId = cst
test.sources.ks.batchSize = 1000
test.sources.ks.batchDurationMillis = 1000
test.sources.ks.kafka.consumer.timeout.ms = 100
test.sources.ks.kafka.auto.offset.reset = smallest
# sink
test.sinks.es.type = org.es.TestElasticSearchSink
test.sinks.es.hostNames = 127.0.0.1:9200
test.sinks.es.indexName = test-idx
test.sinks.es.batchSize = 1000
test.sinks.es.iaCacheLifetime = 20
# Normal channel
test.channels.kc.type = org.kc.TestKafkaChannel
test.channels.kc.capacity = 10000
test.channels.kc.transactionCapacity = 1000
test.channels.kc.brokerList = 127.0.0.1:9092
test.channels.kc.topic = test-topic
test.channels.kc.zookeeperConnect = 127.0.0.1:2181
test.channels.kc.parseAsFlumeEvent = false
test.channels.kc.readSmallestOffset = true
test.channels.kc.groupId = test-flume

You will probably want to pre-create all necessary Kafka topics before starting Flume. However, it's not clear what is org.kc.TestKafkaChannel, or org.es.TestElasticSearchSink. Flume has provided classes for both of these (Kafka channel +Elasticsearch sink), I believe, so anything "not working" would begin in either of your "custom" classes here...
Alternatively, Kafka Connect already has an Elasticsearch sink connector, so you don't need an intermediate Kafka topic just to send data between Kafka and Elasticsearch. Logstash would work as well.

Related

__consumer_offset is unable to sync

I am using mm2 with below properties
source(A),sink(B) clusters both have their own separate zookeeper
I consume some data from topic test in source A.
then I stopped consumer, and start mirror process
when I pointed consumer with same group id to sink then it start consuming from beginning. I am expecting it should start in sink from where it left off in source.
###############
A.bootstrap.servers = localhost:9092
B.bootstrap.servers = localhost:9093
A->B.enabled = true
A->B.topics = test
#B->A.enabled = true
#B->A.topics = .*
checkpoints.topic.replication.factor=1
heartbeats.topic.replication.factor=1
offset-syncs.topic.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
config.storage.replication.factor=1```
Since Kafka 2.7, MirrorMaker can automatically mirror consumer group offsets by setting sync.group.offsets.enabled=true.
In your example:
A->B.sync.group.offsets.enabled=true
Before 2.7, MirrorMaker does not automatically commit consumer group offsets and you need to use RemoteClusterUtils to do the offsets translation.

Kafka Mirror Maker 2 Offset Replication Not Working

We are testing DR Scenario for kafka. we have 2 kafka cluster in separate region. We are using MirrorMaker2 to replicate the topics and messages.
Topics and messages are able to replicate. But we are observing offset is not replicating.
e.g
produced 10 messages from producuder pointed to kafka region 1.
Consumed 5 messages on from consumer pointed to kafka region 1
stop consumer pointed to region1
start consumer pointed to region2
consume the message
here expectation is region 2 consumer should consume from offset 6
but it starts consuming from offset 0
below is property file
clusters = primary, secondary
# primary cluster information
primary.bootstrap.servers = test1-primary.com:9094,test2-primary.com.apttuscloud.io:9094,test3-primary.com:9094
primary.security.protocol= SASL_SSL
primary.ssl.truststore.password= dummypassword
primary.ssl.truststore.location= /opt/bitnami/kafka/config/certs/kafka.truststore.jks
primary.ssl.keystore.password= dummypassword
primary.ssl.keystore.location= /opt/bitnami/kafka/config/certs/kafka.keystore.jks
primary.ssl.endpoint.identification.algorithm=
primary.sasl.mechanism= PLAIN
primary.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="dummyuser" password="dummypassword";
# secondary cluster information
secondary.bootstrap.servers = test1-secondary.com:9094,test2-secondary.com.apttuscloud.io:9094,test3-secondary.com:9094
secondary.security.protocol= SASL_SSL
secondary.ssl.truststore.password= dummypassword
secondary.ssl.truststore.location= /opt/bitnami/kafka/config/certs/kafka.truststore.jks
secondary.ssl.keystore.password= dummypassword
secondary.ssl.keystore.location= /opt/bitnami/kafka/config/certs/kafka.keystore.jks
secondary.ssl.endpoint.identification.algorithm=
secondary.sasl.mechanism=PLAIN
secondary.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="dummyuser" password="dummypassword";
# Topic Configuration
primary->secondary.enabled = true
primary->secondary.topics = .*
secondary->primary.enabled = true
secondary->primary.topics = .*
############################# Internal Topic Settings #############################
# The replication factor for mm2 internal topics "heartbeats", "B.checkpoints.internal" and
# "mm2-offset-syncs.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3
checkpoints.topic.replication.factor= 3
heartbeats.topic.replication.factor= 3
offset-syncs.topic.replication.factor= 3
# The replication factor for connect internal topics "mm2-configs.B.internal", "mm2-offsets.B.internal" and
# "mm2-status.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
replication.factor = 3
refresh.topics.enabled = true
sync.topic.configs.enabled = true
refresh.topics.interval.seconds = 10
topics.blacklist = .*[\-\.]internal, .*\.replica, __consumer_offsets
groups.blacklist = console-consumer-.*, connect-.*, __.*
primary->secondary.emit.heartbeats.enabled = true
primary->secondary.emit.checkpoints.enabled = true
Please note some confedentilal values are placed with dummy values
Regards,
Narendra Jadhav
With MirrorMaker 2.5, when moving consumers between clusters, offsets are not automatically translated.
So upon starting consumers on another cluster, consumers need to use RemoteClusterUtils.translateOffsets() to find their offsets in this cluster.
In 2.7 (expected November 2020), you can have MirrorMaker 2 automatically translate offsets, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0

How sink topic, kafka to kafka, using Flume?

I am trying transfer log from topic to another topic. I need connect Kafka to Kafka using Flume. Take a look below:
#
# Flume Conf
#
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Kafka Source
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.kafka.bootstrap.servers = kafka:9092
a1.sources.s1.kafka.topics = apache
# Kafka Sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = kafka:9092
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000000
a1.channels.c1.transactionCapacity = 1000000
# Bind the source and sink to the channel
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
The sink are not creating.
If you want to replicate data from one Kafka cluster to another, there are better ways than Flume, including:
MirrorMaker, as #cricket_007 mentioned (open source, part of Apache Kafka)
Confluent's Replicator (commercial tool, 30 day free trial)
uReplicator (open sourced from Uber)
Mirus (open sourced from Salesforce)
Brucke (open source)
If you want a really dirty hack, you can also do something with kafkacat and nc.
Disclaimer: I work for Confluent.

Streaming Kafka Messages to MySQL Database

I want to write Kafka messages to MySQL database. There is an example in this link. In that example, apache flume used for consume messages and writing it to MySQL. I'm using same code and when i run the flume-ng agent and event always becomes null
And my flume.conf.properties file is:
agent.sources=kafkaSrc
agent.channels=channel1
agent.sinks=jdbcSink
agent.channels.channel1.type=org.apache.flume.channel.kafka.KafkaChannel
agent.channels.channel1.brokerList=localhost:9092
agent.channels.channel1.topic=kafkachannel
agent.channels.channel1.zookeeperConnect=localhost:2181
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapacity=1000
agent.channels.channel1.parseAsFlumeEvent=false
agent.sources.kafkaSrc.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSrc.channels = channel1
agent.sources.kafkaSrc.zookeeperConnect = localhost:2181
agent.sources.kafkaSrc.topic = kafka-mysql
agent.sinks.jdbcSink.type = com.stratio.ingestion.sink.jdbc.JDBCSink
agent.sinks.jdbcSink.connectionString = jdbc:mysql://127.0.0.1:3306/test?useSSL=false
agent.sinks.jdbcSink.username=root
agent.sinks.jdbcSink.password=pass
agent.sinks.jdbcSink.batchSize = 10
agent.sinks.jdbcSink.channel =channel1
agent.sinks.jdbcSink.sqlDialect=MYSQL
agent.sinks.jdbcSink.driver=com.mysql.jdbc.Driver
agent.sinks.jdbcSink.sql=INSERT INTO kafkamsg(msg) VALUES(${body:varchar})
Where I'm wrong?
Thanks.
In my referance example, flume listens kafka for kafka-mysql topic. But this code works for kafkachannel topic. So we need to produce messages to kafkachannel topic, i don't know why.

Flume Kafka sink not able to write complete messages to Kafka Broker

I have written a process where I'm generating messages thru custom flume source and Flume Kafka sink provided by Hortonworks to write into Kafka brokers.
During this process i have noticed that if KAFKA broker is already running and then i start my Flume agent it delivers each and every message to the Kafka broker properly but when i starts Kafka broker when Flume agent is already running, KAFKA broker is not able to receive all the messages.
When i run Kafka Console consumer to check the counts of messages received i noticed it is dropping few records from beginning and few records from the end.
I have tried multiple mix and match in Flume.conf but still it is working as expected.
Below are the configuration parameter which i have provided to
Flume.conf -
agent.channels = firehose-channel
agent.sources = stress-source
agent.sinks = kafkasink
#################################
# Benchmark Souce Configuration #
#################################
agent.sources.stress-source.type=com.kohls.flume.source.stress.BenchMarkTestScenriao
agent.sources.stress-source.size=5000
agent.sources.stress-source.maxTotalEvents=30000
agent.sources.stress-source.batchSize=200
agent.sources.stress-source.throughputThreshold=4000
agent.sources.stress-source.throughputControlSeconds=1
agent.sources.stress-source.channels=firehose-channel
#################################
# Firehose Channel Configuration #
#################################
agent.channels.firehose-channel.type = file
agent.channels.firehose-channel.checkpointDir = /data/flume/checkpoint
agent.channels.firehose-channel.dataDirs = /data/flume/data
agent.channels.firehose-channel.capacity = 10000
agent.channels.firehose-channel.transactionCapacity = 10000
agent.channels.firehose-channel.useDualCheckpoints=1
agent.channels.firehose-channel.backupCheckpointDir=/data/flume/backup
############################################
# Firehose Sink Configuration - Kafka Sink #
############################################
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.topic = backoff_test_17
agent.sinks.kafkasink.channel=firehose-channel
agent.sinks.kafkasink.brokerList = sandbox.hortonworks.com:6667
agent.sinks.kafkasink.batchsize = 200
agent.sinks.kafkasink.requiredAcks = 1
agent.sinks.kafkasink.kafka.producer.type = async
agent.sinks.kafkasink.kafka.batch.num.messages = 200
I have also tried to analyses the flume log and noticed that the flume metrics are properly showing the PUT and TAKE count.
Please let me know if anyone has any pointer to solve this issue. Appreciating your help in advance.