Storm-kafka-mongoDB integration - mongodb

I am reading 500 MB random tuples from Kafka producer continuous and in a storm topology I am inserting it to MongoDb using Mongo Java Driver. The problem is I am getting really low throughput as 4-5 tuples per second.
Without DB insert if I write a simple print statement I get throughput as 684 tuples per second. I am planning to run 1Million records from Kafka and check the throughput with mongo insert.
I tried to tune using config setMaxSpoutPending , setMessageTimeoutSecs parms in kafkaconfig.
final SpoutConfig kafkaConf = new SpoutConfig(zkrHosts, kafkaTopic, zkRoot, clientId);
kafkaConf.ignoreZkOffsets=false;
kafkaConf.useStartOffsetTimeIfOffsetOutOfRange=true;
kafkaConf.startOffsetTime=kafka.api.OffsetRequest.LatestTime();
kafkaConf.stateUpdateIntervalMs=2000;
kafkaConf.scheme = new SchemeAsMultiScheme(new StringScheme());
final TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("kafka-spout", new KafkaSpout(kafkaConf), 1);
topologyBuilder.setBolt("print-messages", new MyKafkaBolt()).shuffleGrouping("kafka-spout");
Config conf = new Config();
conf.setDebug(true);
conf.setMaxSpoutPending(1000);
conf.setMessageTimeoutSecs(30);
Execute method of bolt
JSONObject jObj = new JSONObject();
jObj.put("key", input.getString(0));
if (null !=jObj && jObj.size() > 0 ) {
final DBCollection quoteCollection = dbConnect.getConnection().getCollection("stormPoc");
if (quoteCollection != null) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.putAll(jObj);
quoteCollection.insert(dbObject);
// logger.info("inserted in Collection !!!");
} else {
logger.info("Error while inserting data in DB!!!");
}
collector.ack(input);

There is a storm-mongodb module for integration with Mongo. Does it not do the job? https://github.com/apache/storm/tree/b07413670fa62fec077c92cb78fc711c3bda820c/external/storm-mongodb
You shouldn't use storm-kafka for Kafka integration, it is deprecated. Use storm-kafka-client instead.
Setting conf.setDebug(true) will impact your processing, as Storm will log a fairly huge amount of text per tuple.

Related

Is there a retention policy for custom state store (RocksDb) with Kafka streams?

I am setting up a new Kafka streams application, and want to use custom state store using RocksDb. This is working fine for putting data in state store and getting a queryable state store from it and iterating over the data, However, after about 72 hours I observe data to be missing from the store. Is there a default retention time on data for state store in Kafka streams or in RocksDb?
I are using custom state store using RocksDb so that we can utilize the column family feature, that we can't use with the embedded RocksDb implementation with KStreams. I have implementated custom store using KeyValueStore interface. And have my own StoreSupplier, StoreBuilder, StoreType and StoreWrapper as well.
A changelog topic is created for the application but no data is going to it yet (haven't looked into that problem yet).
Putting data into this custom state store and getting queryable state store from it is working fine. However, I am seeing that data is missing after about 72 hours from the store. I checked by getting the size of the state store directory as well as by exporting the data into files and checking the number of entries.
Using SNAPPY compression and UNIVERSAL compaction
Simple topology:
final StreamsBuilder builder = new StreamsBuilder();
String storeName = "store-name"
List<String> cfNames = new ArrayList<>();
// Hybrid custom store
final StoreBuilder customStore = new RocksDBColumnFamilyStoreBuilder(storeName, cfNames);
builder.addStateStore(customStore);
KStream<String, String> inputstream = builder.stream(
inputTopicName,
Consumed.with(Serdes.String(), Serdes.String()
));
inputstream
.transform(() -> new CurrentTransformer(storeName), storeName);
Topology tp = builder.build();
Snippet from custom store implementation:
RocksDBColumnFamilyStore(final String name, final String parentDir, List<String> columnFamilyNames) {
.....
......
final BlockBasedTableConfig tableConfig = new BlockBasedTableConfig()
.setBlockCache(cache)
.setBlockSize(BLOCK_SIZE)
.setCacheIndexAndFilterBlocks(true)
.setPinL0FilterAndIndexBlocksInCache(true)
.setFilterPolicy(filter)
.setCacheIndexAndFilterBlocksWithHighPriority(true)
.setPinTopLevelIndexAndFilter(true)
;
cfOptions = new ColumnFamilyOptions()
.setCompressionType(CompressionType.SNAPPY_COMPRESSION)
.setCompactionStyle(CompactionStyle.UNIVERSAL)
.setMaxWriteBufferNumber(MAX_WRITE_BUFFERS)
.setOptimizeFiltersForHits(true)
.setLevelCompactionDynamicLevelBytes(true)
.setTableFormatConfig(tableConfig);
columnFamilyDescriptors.add(new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOptions));
columnFamilyNames.stream().forEach((cfName) -> columnFamilyDescriptors.add(new ColumnFamilyDescriptor(cfName.getBytes(), cfOptions)));
}
#SuppressWarnings("unchecked")
public void openDB(final ProcessorContext context) {
Options opts = new Options()
.prepareForBulkLoad();
options = new DBOptions(opts)
.setCreateIfMissing(true)
.setErrorIfExists(false)
.setInfoLogLevel(InfoLogLevel.INFO_LEVEL)
.setMaxOpenFiles(-1)
.setWriteBufferManager(writeBufferManager)
.setIncreaseParallelism(Math.max(Runtime.getRuntime().availableProcessors(), 2))
.setCreateMissingColumnFamilies(true);
fOptions = new FlushOptions();
fOptions.setWaitForFlush(true);
dbDir = new File(new File(context.stateDir(), parentDir), name);
try {
Files.createDirectories(dbDir.getParentFile().toPath());
db = RocksDB.open(options, dbDir.getAbsolutePath(), columnFamilyDescriptors, columnFamilyHandles);
columnFamilyHandles.stream().forEach((handle) -> {
try {
columnFamilyMap.put(new String(handle.getName()), handle);
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
});
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
open = true;
}
The expectation is that the state store (RocksDb) will retain the data indefinitely until manually deleted or until the storage disk goes down. I am not aware that Kafka streams has introduced having TTl with state stores yet.

Kafka consumer is very slow to consume data and only consuming first 500 records

I am trying to integrate MongoDB and Storm-Kafka, Kafka Producer produces data from MongoDB but it fails to fetch all records from Consumer side. It only consuming 500-600 records out of 1 million records.
There are no errors in log file, topology is still alive but not processing further records.
Kafka version :0.10.* Storm version :1.2.1
Do i need to add any configs in Consumer?
conf.put(Config.TOPOLOGY_BACKPRESSURE_ENABLE, false);
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 2048);
conf.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384);
conf.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 16384);
BrokerHosts hosts = new ZkHosts(zookeeperUrl);
SpoutConfig spoutConfig = new SpoutConfig(hosts, topic, zkRoot, consumerGroupId);
spoutConfig.scheme = new KeyValueSchemeAsMultiScheme(new StringKeyValueScheme());
spoutConfig.fetchSizeBytes = 25000000;
if (startFromBeginning) {
spoutConfig.startOffsetTime = OffsetRequest.EarliestTime();
} else {
spoutConfig.startOffsetTime = OffsetRequest.LatestTime();
}
return new KafkaSpout(spoutConfig);
}
I want Kafka spout should read all records from kafka topic which are produced by producer.

How to aggregate data hourly?

Whenever a user favorites some content on our site we collect the events and what we were planning to do is to hourly commit the aggregated favorites of a content and update the total favorite count in the DB.
We were evaluating Kafka Streams. Followed the word count example. Our topology is simple, produce to a topic A and read and commit aggregated data to another topic B. Then consume events from Topic B every hour and commit in the DB.
#Bean(name = KafkaStreamsDefaultConfiguration.DEFAULT_STREAMS_CONFIG_BEAN_NAME)
public StreamsConfig kStreamsConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "favorite-streams");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallclockTimestampExtractor.class.getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
return new StreamsConfig(props);
}
#Bean
public KStream<String, String> kStream(StreamsBuilder kStreamBuilder) {
StreamsBuilder builder = streamBuilder();
KStream<String, String> source = builder.stream(topic);
source.flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
.groupBy((key, value) -> value)
.count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>> as("counts-store")).toStream()
.to(topic + "-grouped", Produced.with(Serdes.String(), Serdes.Long()));
Topology topology = builder.build();
KafkaStreams streams = new KafkaStreams(topology, kStreamsConfigs());
streams.start();
return source;
}
#Bean
public StreamsBuilder streamBuilder() {
return new StreamsBuilder();
}
However when I consume this Topic B it gives me aggregated data from the beginning. My question is that can we have some provision wherein I can consume the previous hours grouped data and then commit to DB and then Kakfa forgets about the previous hours data and gives new data each hour rather than cumulative sum. Is the design topology correct or can we do something better?
If you want to get one aggregation result per hour, you can use a windowed aggregation with a window size of 1 hour.
stream.groupBy(...)
.windowedBy(TimeWindow.of(1 *3600 * 1000))
.count(...)
Check the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
The output type is Windowed<String> for the key (not String). You need to provide a custom Window<String> Serde, or convert the key type. Consult SessionWindowsExample.

Spark Streaming from Kafka topic throws offset out of range with no option to restart the stream

I have a streaming job running on Spark 2.1.1, polling Kafka 0.10. I am using the Spark KafkaUtils class to create a DStream, and everything is working fine until I have data that ages out of the topic because of the retention policy. My problem comes when I stop my job to make some changes if any data has aged out of the topic I get an error saying that my offsets are out of range. I have done a lot of research including looking at the spark source code, and I see lots of comments like the comments in this issue: SPARK-19680 - basically saying that data should not be lost silently - so auto.offset.reset is ignored by spark. My big question, though, is what can I do now? My topic will not poll in spark - it dies on startup with the offsets exception. I don't know how to reset the offsets so my job will just get started again. I have not enabled checkpoints since I read that those are unreliable for this use. I used to have a lot of code to manage offsets, but it appears that spark ignores requested offsets if there are any committed, so I am currently managing offsets like this:
val stream = KafkaUtils.createDirectStream[String, T](
ssc,
PreferConsistent,
Subscribe[String, T](topics, kafkaParams))
stream.foreachRDD { (rdd, batchTime) =>
val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
Log.debug("processing new batch...")
val values = rdd.map(x => x.value())
val incomingFrame: Dataset[T] = SparkUtils.sparkSession.createDataset(values)(consumer.encoder()).persist
consumer.processDataset(incomingFrame, batchTime)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsets)
}
ssc.start()
ssc.awaitTermination()
As a workaround I have been changing my group ids but that is really lame. I know this is expected behavior and should not happen, I just need to know how to get the stream running again. Any help would be appreciated.
Here is a block of code I wrote to get by this until a real solution is introduced to spark-streaming-kafka. It basically resets the offsets for the partitions that have aged out based on the OffsetResetStrategy you set. Just give it the same Map params, _params, you provide to KafkaUtils. Call this before calling KafkaUtils.create****Stream() from your driver.
final OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(_params.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toString().toUpperCase(Locale.ROOT));
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy) || OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
LOG.info("Going to reset consumer offsets");
final KafkaConsumer<K,V> consumer = new KafkaConsumer<>(_params);
LOG.debug("Fetching current state");
final List<TopicPartition> parts = new LinkedList<>();
final Map<TopicPartition, OffsetAndMetadata> currentCommited = new HashMap<>();
for(String topic: this.topics()) {
List<PartitionInfo> info = consumer.partitionsFor(topic);
for(PartitionInfo i: info) {
final TopicPartition p = new TopicPartition(topic, i.partition());
final OffsetAndMetadata m = consumer.committed(p);
parts.add(p);
currentCommited.put(p, m);
}
}
final Map<TopicPartition, Long> begining = consumer.beginningOffsets(parts);
final Map<TopicPartition, Long> ending = consumer.endOffsets(parts);
LOG.debug("Finding what offsets need to be adjusted");
final Map<TopicPartition, OffsetAndMetadata> newCommit = new HashMap<>();
for(TopicPartition part: parts) {
final OffsetAndMetadata m = currentCommited.get(part);
final Long begin = begining.get(part);
final Long end = ending.get(part);
if(m == null || m.offset() < begin) {
LOG.info("Adjusting partition {}-{}; OffsetAndMeta={} Begining={} End={}", part.topic(), part.partition(), m, begin, end);
final OffsetAndMetadata newMeta;
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(begin);
} else if(OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(end);
} else {
newMeta = null;
}
LOG.info("New offset to be {}", newMeta);
if(newMeta != null) {
newCommit.put(part, newMeta);
}
}
}
consumer.commitSync(newCommit);
consumer.close();
}
auto.offset.reset=latest/earliest will be applied only when consumer starts first time.
there is Spark JIRA to resolve this issue, till then we need live with work arounds.
https://issues.apache.org/jira/browse/SPARK-19680
Try
auto.offset.reset=latest
Or
auto.offset.reset=earliest
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
One more thing that affects what offset value will correspond to smallest and largest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 10 messages, and then an hour later you post 10 more messages. The largest offset will still remain the same but the smallest one won't be able to be 0 because Kafka will already remove these messages and thus the smallest available offset will be 10.
This problem was solved in the stream structuring structure by including "failOnDataLoss" = "false". It is unclear why there is no such option in the spark DStream framework.
This is a BIG quesion for spark developers!
In our projects, we tried to solve this problem by resetting the offsets form ealiest + 5 minutes ... it helps in most cases.

Multiple Streams in Trident Topology

I have multiple OpaqueTridentKafkaSpout reading from different Kafka topics. I want data from all these streams to go through same set of Functions. What is the best way to achieve that.
Do I need to create separate streams and pass each Tuple to same set of functions again. Like below?
BrokerHosts zk = new ZkHosts(getZooKeeperHosts());
TridentKafkaConfig spoutConf = new TridentKafkaConfig(zk, "Test");
spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
TridentKafkaConfig spoutConf1 = new TridentKafkaConfig(zk, "Test1");
spoutConf1.scheme = new SchemeAsMultiScheme(new StringScheme());
OpaqueTridentKafkaSpout kafkaSpout1 = new OpaqueTridentKafkaSpout(spoutConf1);
topology.newStream("event", kafkaSpout).each(new Fields("document"), new ExtractDocumentInfo(), new Fields("id", "index", "type"));
topology.newStream("event1", kafkaSpout1).each(new Fields("document"), new ExtractDocumentInfo(), new Fields("id", "index", "type"));
You can merge the streams together, but any failure will cause both spouts to replay the batch.