kafka 0.72, minimum number of brokers - apache-kafka

I'm trying to create a kafka producer that sends messages to kafka brokers (and not to zoo keeper).
I know that the better practice is working with zk, but for the moment I would like to send messages directly to a broker.
To do that, I'm setting the property "broker.list" as described in the documentation. The thing is that it appears that in order for it to work it requires minimum of 3 brokers (else I get an exception).
In the source code of kafka I can see:
if(brokerInfo.size < 3) throw new InvalidConfigException("broker.list has invalid value")
This is weird cause in my data center I hold only 2 kafka nodes (and 3 zk), what can I do in this case?
Is there a way go around this?

The brokerInfo is obtained by splitting the individual broker info and NOT the number of brokers .. if you checked the source code more carefully you would see some thing like
// check if each individual broker info is valid => (brokerId: brokerHost: brokerPort)
and then they split this info as below
brokerInfoList.foreach { bInfo =>
val brokerInfo = bInfo.split(":")
if(brokerInfo.size < 3) throw new InvalidConfigException("broker.list has invalid value")
}
so every single broker expected to have an id with host name and port separated by the : delimiter
basically regarding the number of broker it just do this
val brokerInfoList = config.brokerList.split(",")
if(brokerInfoList.size == 0) throw new InvalidConfigException("broker.list is empty")
So you should be fine with that I guess, just try to pass a single broker and it should work. Let us know how it goes

Apparently when writing
props.put("broker.list", "0:" + <host:port>);
It works (I added the "0:" to the original string).
I have found it in section 9 of the quick start guide.
I'm not sure I'm getting it, maybe this zero is the partition number(?) maybe something else (could be nice if someone can shed some light here).

Related

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Kafka Streams stops if I use persistentKeyValueStore but works fine with inMemoryKeyValueStore (running in Docker container)

I'm obviously a beginner with kafka/kafka streams. I just need to read given messages from a few topics, given their id. While our actual topology is fairly complex, this Stream app just needs to achieve this single simple goal
This is how a store is created :
final StreamsBuilder streamsBuilder = new StreamsBuilder();
streamsBuilder.table(
topic,
Materialized.<String, String>as( persistentKeyValueStore(storeNameOf(topic)))
.withKeySerde(Serdes.String()).withValueSerde(Serdes.String())
.withCachingDisabled());
// Materialized.<String, String>as( inMemoryKeyValueStore(storeNameOf(topic)))
// .withKeySerde(Serdes.String()).withValueSerde(Serdes.String())
// .withCachingDisabled());
);
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), new Properties() {{ /** config items go here**/ }})
kafkaStreams.start();
//logic for awaiting kafkaStreams to reach `RUNNING` state as well as InvalidStateStoreException handling (by retrying) is ommited for simplicity :
ReadOnlyKeyValueStore<String, String> replyStore = kafkaStreams.store(storeNameOf(topicName), QueryableStoreTypes.keyValueStore());
So, when using the commented inMemoryKeyValueStore materialization replyStore is sucessfully created and I can query the values within without a problem
With persistentKeyValueStore the last line fails with java.lang.IllegalStateException: KafkaStreams is not running. State is ERROR. Note that I do check that KafkaStreams is in state RUNNING before the store call; the ERROR state is reached somehow within the call rather.
Do you think i might have missed anything when setting up the persistent store? Debugging hints would also greatly help, i'm quite stuck here I must confess
Thanks !
Edit : The execution happens under a docker container. This was quite relevant but I ommited to add initialy
As Matthias J. Sax pointed out in comment form, to debug the problem the uncaughtExceptionHandler registration helped greatly .
The actual issue was due to an incompatibility between RocksDB and the docker image I was using (so changed from openjdk:8-jdk-alpine to anapsix/alpine-java:8 )
Related :
https://issues.apache.org/jira/browse/KAFKA-4988
UnsatisfiedLinkError: /tmp/snappy-1.1.4-libsnappyjava.so Error loading shared library ld-linux-x86-64.so.2: No such file or directory

Alpakka Kafka stream never getting terminated

We are using Alpakka Kafka streams for consuming events from Kafka. Here is how the stream is defined as:
ConsumerSettings<GenericKafkaKey, GenericKafkaMessage> consumerSettings =
ConsumerSettings
.create(actorSystem, new KafkaJacksonSerializer<>(GenericKafkaKey.class),
new KafkaJacksonSerializer<>(GenericKafkaMessage.class))
.withBootstrapServers(servers).withGroupId(groupId)
.withClientId(clientId).withProperties(clientConfigs.defaultConsumerConfig());
CommitterSettings committerSettings = CommitterSettings.create(actorSystem)
.withMaxBatch(20)
.withMaxInterval(Duration.ofSeconds(30));
Consumer.DrainingControl<Done> control =
Consumer.committableSource(consumerSettings, Subscriptions.topics(topics))
.mapAsync(props.getMessageParallelism(), msg ->
CompletableFuture.supplyAsync(() -> consumeMessage(msg), actorSystem.dispatcher())
.thenCompose(param -> CompletableFuture.supplyAsync(() -> msg.committableOffset())))
.toMat(Committer.sink(committerSettings), Keep.both())
.mapMaterializedValue(Consumer::createDrainingControl)
.run(materializer);
Here is the piece of code that is shutting down the stream:
CompletionStage<Done> completionStage = control.drainAndShutdown(actorSystem.dispatcher());
completionStage.toCompletableFuture().join();
I tried doing a get too on the completable future. But neither join nor get on future are returning. Have anyone else too faced similar problem? Is there something that I am doing wrong here?
If you want to control stream termination from outside the stream, you need to use a KillSwitch : https://doc.akka.io/docs/akka/current/stream/stream-dynamic.html
Your usage looks correct and I can't identify anything that would hinder draining.
A common thing to miss with Alpakka Kafka consumers is the stop-timeout which defaults to 30 seconds.
When using the DrainingControl you can safely set it to 0 seconds.
See https://doc.akka.io/docs/alpakka-kafka/current/consumer.html#draining-control

Read from Kinesis is giving empty records when run using previous sequence number or timestamp

I am trying to read the messages pushed to Kinesis stream with the help of
get_records() and get_shard_iterator() APIs.
My producer keeps pushing the records when processed at it's end and consumer also keeps running as a cron every 30 minutes. So, I tried storing the sequence number of the current message read in my database and use AFTER_SEQUENCE_NUMBER shard iterator along with the sequence number last read. However, the same won't work for the second time (first time successfully read all messages in the stream) after new messages are pushed.
I also tried using AT_TIMESTAMP along with message timestamp that producer pushed to stream as part of the message and stored that message to be further used. Again, first run processes all messages and from the second run I get empty records.
I am really not sure where I am going wrong. I would appreciate if someone can help me in this.
Providing the code below using timestamp but the same thing is done for sequence number method too.
def listen_to_kinesis_stream():
kinesis_client = boto3.client('kinesis', region_name=SETTINGS['region_name'])
stream_response = kinesis_client.describe_stream(StreamName=SETTINGS['kinesis_stream'])
for shard_info in stream_response['StreamDescription']['Shards']:
kinesis_stream_status = mongo_coll.find_one({'_id': "DOC_ID"})
last_read_ts = kinesis_stream_status.get('state', {}).get(
shard_info['ShardId'], datetime.datetime.strftime(datetime.date(1970, 01, 01), "%Y-%m-%dT%H:%M:%S.%f"))
shard_iterator = kinesis_client.get_shard_iterator(
StreamName=SETTINGS['kinesis_stream'],
ShardId=shard_info['ShardId'],
ShardIteratorType='AT_TIMESTAMP',
Timestamp=last_read_ts)
get_response = kinesis_client.get_records(ShardIterator=shard_iterator['ShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
continue
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
generic_config_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
while 'NextShardIterator' in get_response:
get_response = kinesis_client.get_records(ShardIterator=get_response['NextShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
break
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
mongo_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
logger.debug("Processed all messages from Kinesis stream")
print "Processed all messages from Kinesis stream"
As per my discussion with AWS technical support person, there can be a few messages with empty records and hence it is not a good idea to break when len(get_response['Records']) == 0.
The better approach suggested was - we can have a counter indicating maximum number of messages that you read in a run and exit loop after reading as many messages.