Multiple Streams in Trident Topology

Multiple Streams in Trident Topology - apache-kafka

I have multiple OpaqueTridentKafkaSpout reading from different Kafka topics. I want data from all these streams to go through same set of Functions. What is the best way to achieve that.
Do I need to create separate streams and pass each Tuple to same set of functions again. Like below?
BrokerHosts zk = new ZkHosts(getZooKeeperHosts());
TridentKafkaConfig spoutConf = new TridentKafkaConfig(zk, "Test");
spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
TridentKafkaConfig spoutConf1 = new TridentKafkaConfig(zk, "Test1");
spoutConf1.scheme = new SchemeAsMultiScheme(new StringScheme());
OpaqueTridentKafkaSpout kafkaSpout1 = new OpaqueTridentKafkaSpout(spoutConf1);
topology.newStream("event", kafkaSpout).each(new Fields("document"), new ExtractDocumentInfo(), new Fields("id", "index", "type"));
topology.newStream("event1", kafkaSpout1).each(new Fields("document"), new ExtractDocumentInfo(), new Fields("id", "index", "type"));

You can merge the streams together, but any failure will cause both spouts to replay the batch.

Related

Storm-kafka-mongoDB integration

I am reading 500 MB random tuples from Kafka producer continuous and in a storm topology I am inserting it to MongoDb using Mongo Java Driver. The problem is I am getting really low throughput as 4-5 tuples per second.
Without DB insert if I write a simple print statement I get throughput as 684 tuples per second. I am planning to run 1Million records from Kafka and check the throughput with mongo insert.
I tried to tune using config setMaxSpoutPending , setMessageTimeoutSecs parms in kafkaconfig.
final SpoutConfig kafkaConf = new SpoutConfig(zkrHosts, kafkaTopic, zkRoot, clientId);
kafkaConf.ignoreZkOffsets=false;
kafkaConf.useStartOffsetTimeIfOffsetOutOfRange=true;
kafkaConf.startOffsetTime=kafka.api.OffsetRequest.LatestTime();
kafkaConf.stateUpdateIntervalMs=2000;
kafkaConf.scheme = new SchemeAsMultiScheme(new StringScheme());
final TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("kafka-spout", new KafkaSpout(kafkaConf), 1);
topologyBuilder.setBolt("print-messages", new MyKafkaBolt()).shuffleGrouping("kafka-spout");
Config conf = new Config();
conf.setDebug(true);
conf.setMaxSpoutPending(1000);
conf.setMessageTimeoutSecs(30);
Execute method of bolt
JSONObject jObj = new JSONObject();
jObj.put("key", input.getString(0));
if (null !=jObj && jObj.size() > 0 ) {
final DBCollection quoteCollection = dbConnect.getConnection().getCollection("stormPoc");
if (quoteCollection != null) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.putAll(jObj);
quoteCollection.insert(dbObject);
// logger.info("inserted in Collection !!!");
} else {
logger.info("Error while inserting data in DB!!!");
}
collector.ack(input);

There is a storm-mongodb module for integration with Mongo. Does it not do the job? https://github.com/apache/storm/tree/b07413670fa62fec077c92cb78fc711c3bda820c/external/storm-mongodb
You shouldn't use storm-kafka for Kafka integration, it is deprecated. Use storm-kafka-client instead.
Setting conf.setDebug(true) will impact your processing, as Storm will log a fairly huge amount of text per tuple.

KafkaSpout multithread or not

kafka 0.8.x doc shows how to multithread in kafka consumer:
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
But KafkaSpout in storm seems to not multithread.
Maybe use multi task instead of multithread in KafkaSpout :
builder.setSpout(SqlCollectorTopologyDef.KAFKA_SPOUT_NAME, new KafkaSpout(spoutConfig), nThread);
Which one is better? Thanks

Since you mentioned Kafka 0.8.x, I am assuming the KafkaSpout you use is from storm-kafka other than storm-kafka-client.
The first code snippet is high-level consumer's API which could employ multiple threads to consume multiple partitions.
As for the kafka spout, it's probably the same, but Storm is using the low-level consumer, namely SimpleConsumer. However, there will be one SimpleConsumer instance created for each spout executor(task).

How to properly read from Ignite Cache

I have the following application (I'm quite new to this framework) and I'd like to see the cache size (increasing) as it reads messages from the queue but it stays 0 all the time.
KafkaStreamer<String, String, String> kafkaStreamer = new KafkaStreamer<>();
Ignition.setClientMode(true);
Ignite ignite = Ignition.start();
Properties settings = new Properties();
// Set a few key parameters
settings.put("bootstrap.servers", "localhost:9092");
settings.put("group.id", "test");
settings.put("zookeeper.connect", "localhost:2181");
settings.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
settings.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
settings.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
settings.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// Create an instance of StreamsConfig from the Properties instance
kafka.consumer.ConsumerConfig config = new ConsumerConfig(settings);
IgniteCache<String, String> cache = ignite.getOrCreateCache("myCache");
IgniteDataStreamer<String, String> stmr = ignite.dataStreamer("myCache");
// allow overwriting cache data
stmr.allowOverwrite(true);
kafkaStreamer.setIgnite(ignite);
kafkaStreamer.setStreamer(stmr);
// set the topic
kafkaStreamer.setTopic("test");
// set the number of threads to process Kafka streams
kafkaStreamer.setThreads(1);
// set Kafka consumer configurations
kafkaStreamer.setConsumerConfig(config);
// set decoders
StringDecoder keyDecoder = new StringDecoder(null);
StringDecoder valueDecoder = new StringDecoder(null);
kafkaStreamer.setKeyDecoder(keyDecoder);
kafkaStreamer.setValueDecoder(valueDecoder);
kafkaStreamer.start();
while (true) {
System.out.println(cache.metrics().getSize());
Thread.sleep(200);
}
Can anyone tell what is missing / wrong?
Thanks!

Probably you don't consume enough entries to fill up IgniteDataStreamer buffers. Try to set flush timeout:
stmr.autoFlushFrequency(1000);

Metrics is disabled by default due to a performance reasons. You can enable metrics using CacheConfiguration.setStatisticsEnabled(true) or statisticsEnabled property in your configuration file.

how to find zkroot and clientid for SpoutConfig

I'm tryingto connect to a remote kafka cluster in storm. I'm using the following code:
Broker brokerForPartition0 = new Broker("208.113.164.114:9091");
Broker brokerForPartition1 = new Broker("208.113.164.115:9092");
Broker brokerForPartition2 = new Broker("208.113.164.117:9093");
GlobalPartitionInformation partitionInfo = new GlobalPartitionInformation();
partitionInfo.addPartition(0, brokerForPartition2);//mapping from partition 0 to brokerForPartition0
partitionInfo.addPartition(1, brokerForPartition0);//mapping from partition 1 to brokerForPartition1
partitionInfo.addPartition(2, brokerForPartition1);//mapping from partition 2 to brokerForPartition2
StaticHosts hosts = new StaticHosts(partitionInfo);
SpoutConfig spoutConfig = new SpoutConfig(hosts, "newImageTest","/brokers","console-consumer-61818");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
In the instanciation of spoutConfig, I have to put as a parameters the zkroot and clientid.
java public SpoutConfig(BrokerHosts hosts, String topic, String zkRoot, String id);
Where can I find these two information? Or should I create them?
Thank you!

From this documentation,
Spoutconfig is an extension of KafkaConfig that supports additional
fields with ZooKeeper connection info and for controlling behavior
specific to KafkaSpout. The Zkroot will be used as root to store your
consumer's offset. The id should uniquely identify your spout.
Zkroot, therefore should be some ZNode path like /some/path which will be used to store your consumer's offset as mentioned.
id is some string, (say a UUID) which can be used to uniquely identify your spout as mentioned.

How to change default kafka SpoutConfig class

I am getting the message stream of 3MB from kafka topic but the default value is 1MB. Now I have changed the kafka properties from 1MB to 3MB by adding the below lines in kafa consumer.properties and server.properties file.
fetch.message.max.bytes=2048576 ( consumer.properties )
filemessage.max.bytes=2048576 ( server.properties )
replica.fetch.max.bytes=2048576 ( server.properties )
Now after adding the above lines in Kafka, 3MB message data is going into kafka data logs. But STORM is unable to process that 3MB data and it is able to read only default size i.e.,1MB data.
So how to change those configurations inorder to process/read the 3MB data. Here is my topology class.
String argument = args[0];
Config conf = new Config();
conf.put(JDBC_CONF, map);
conf.setDebug(true);
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
//set the number of workers
conf.setNumWorkers(3);
TopologyBuilder builder = new TopologyBuilder();
//Setup Kafka spout
BrokerHosts hosts = new ZkHosts("localhost:2181");
String topic = "year1234";
String zkRoot = "";
String consumerGroupId = "group1";
SpoutConfig spoutConfig = new SpoutConfig(hosts, topic, zkRoot, consumerGroupId);
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
builder.setSpout("KafkaSpout", kafkaSpout,1);
builder.setBolt("user_details", new Parserspout(),1).shuffleGrouping("KafkaSpout");
builder.setBolt("bolts_user", new bolts_user(cp),1).shuffleGrouping("user_details");

Add following lines below
SpoutConfig spoutConfig = new SpoutConfig(hosts, topic, zkRoot, consumerGroupId);
spoutConfig.fetchSizeBytes = 3048576;
spoutConfig.bufferSizeBytes = 3048576;