Now I have just set up successfully my storm topology with single node on single machine. I use KafkaSpout as below:
String zkHostPort = "localhost:2181";
String topic = "sentences";
String zkRoot = "/kafka-sentence-spout";
String zkSpoutId = "sentence-spout";
ZkHosts zkHosts = new ZkHosts(zkHostPort);
SpoutConfig spoutCfg = new SpoutConfig(zkHosts, topic, zkRoot, zkSpoutId);
KafkaSpout kafkaSpout = new KafkaSpout(spoutCfg);
return kafkaSpout;
Now I set up cluster zookeeper(three node: server1.com:2181, server2.com:2181. server3.com:2181) and cluster kafka (three node). I wonder how I can change code on Storm Topology for this purpose. Please help me!!
Please, use the configuration below:
String zkHostPort = "server1.com:2181,server2.com:2181,server3.com:2181";
String topic = "sentences";
String zkRoot = "/kafka-sentence-spout";
String zkSpoutId = "sentence-spout";
ZkHosts zkHosts = new ZkHosts(zkHostPort);
SpoutConfig spoutCfg = new SpoutConfig(zkHosts, topic, zkRoot, zkSpoutId);
KafkaSpout kafkaSpout = new KafkaSpout(spoutCfg);
return kafkaSpout;
Note: the most common issue here is space placed between hosts after comma, there must not be space between hosts.
Correct:
server1.com:2181,server2.com:2181,server3.com:2181
Wrong:
server1.com:2181, server2.com:2181, server3.com:2181
Related
We have a Kafka process that takes a topic as input and writes timed window to the output topic.. the following code is being used. I would like to print TimeWindowedKStream(groupedStream) and KTable(aggregatedTable) and see the output for some debugging purposes..
String intopic = input_topic;
Long window = 60;
String outtopic = output_topic;
final Serde<String> stringSerde = Serdes.String();
Properties property = new Properties();
property.put("bootstrap.servers", "127.0.0.1:9092");
property.put("group.id", "test-consumer-group");
property.put("application.id", "sliding-window-min-bar");
property.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
property.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
Duration windowSizeMs = Duration.ofMinutes(window);
StreamsBuilder builder = new StreamsBuilder();
System.out.println(intopic);
KStream<String, String> equitybar = builder.stream(intopic, Consumed.with(stringSerde, stringSerde));
System.out.println(equitybar);
equitybar.print(Printed.toSysOut());
// convert string of csv to a double on the mean value
KStream<String, String> transformedbar = equitybar
.map((key, value) -> KeyValue.pair(key, value.substring(1,value.length()-2).split(",")[2]));
System.out.println(transformedbar);
transformedbar.print(Printed.toSysOut());
// group by equity and sliding window
System.out.println(windowSizeMs);
System.out.println(TimeWindows.of(windowSizeMs).advanceBy(advanceMs));
TimeWindowedKStream<String, String> groupedStream = transformedbar.groupByKey().windowedBy(TimeWindows.of(windowSizeMs).advanceBy(advanceMs));
System.out.println(groupedStream);
KTable<Windowed<String>, String> aggregatedTable = groupedStream.aggregate(
() -> "|",
(aggKey, newValue, aggValue) -> aggValue + newValue.trim() + "|") ;
I tried to print it using the the print command that is used for Kafka streams - groupedStream.print(Printed.toSysOut()); - but it doesn't seem to be working.
Thanks.
KGroupedStream and TimeWindowedKStream are "just" helper classes to allow the DSL to present a fluent API to chain operator without too many overloads on a single class.
In the DSL, there are only two main abstractions, KStream and KTable that are actual first-class data-containers. Thus, it's not possible what you want to do.
I'm trying to read data from kafka topic into DataStream and register DataStream, after that use TableEnvironment.sqlQuery("SQL") to query the data, when TableEnvironment.execute() there is no error and no output.
public static void main(String[] args){
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharateristic.EventTime);
env.enableCheckpointing(5000);
StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env);
FlinkKafkaConsumer<Person> consumer = new FlinkKafkaConsumer(
"topic",
new JSONDeserializer(),
Job.getKafkaProperties
);
consumer.setStartFromEarliest();
DataStream<Person> stream = env.addSource(consumer).fliter(x -> x.status != -1).assignTimestampAndWatermarks(new AssignerWithPeriodicWatermarks<Person>(){
long current = 0L;
final long expire = 1000L;
#Override
public Watermakr getCurrentWatermark(){
return new Watermark(current - expire);
}
#Override
public long extractTimestamp(Person person){
long timestamp = person.createTime;
current = Math.max(timestamp,current);
return timestamp;
}
});
//set createTime as rowtime
tableEnvironment.registerDataStream("Table_Person",stream,"name,age,sex,createTime.rowtime");
Table res = tableEnvironment.sqlQuery("select TUMBLE_END(createTime,INTERVAL '1' minute) as registTime,sex,count(1) as total from Table_Person group by sex,TUMBLE(createTime,INTERVAL '1' minute)");
tableEnvironment.toAppendStream(t,Types.Row(new TypeInformation[]{Types.SQL_TIMESTAMP,Types.STRING,Types.LONG})).print();
tableEnvironment.execute("person-query");
}
when i execute,there was nothing print on console or throw any exceptions;
but if i use fromCollection() as a source,the program will print something on the console;
Can you please guide me to fix this?
dependencies:
flink-streaming-java_2.11 version:1.9.0-csa1.0.0.0;
flink-streaming-scala_2.11 version:1.9.0-csa1.0.0.0;
flink-connector-kafka_2.11 version:1.9.0-csa1.0.0.0;
flink-table-api-java-bridge_2.11 version:1.9.0-csa1.0.0.0;
flink-table-planner_2.11 version:1.9.0-csa1.0.0.0;
In the code where you convert the SQL query's result back to a DataStream, you need to pass res rather than t to toAppendStream. (I can't see how the code you've posted will even compile — where is t declared?) And I think you should be able to do this
Table res = tableEnvironment.sqlQuery("select TUMBLE_END(createTime,INTERVAL '1' minute) as registTime,sex,count(1) as total from Table_Person group by sex,TUMBLE(createTime,INTERVAL '1' minute)");
tableEnvironment.toAppendStream(res,Row.class).print();
rather than bothering with the TypeInformation.
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.kafka.bootstrap.servers= kafka bootstrap servers
flume1.sources.kafka-source-1.kafka.topics = topic
flume1.sources.kafka-source-1.batchSize = 1000
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.sources.kafka-source-1.kafka.consumer.group.id=group_id
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = file_prefix
flume1.sinks.hdfs-sink-1.hdfs.fileSuffix = .avro
flume1.sinks.hdfs-sink-1.hdfs.inUsePrefix = tmp/
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /user/directory/ingest_date=%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=0
flume1.sinks.hdfs-sink-1.hdfs.rollSize=1000000
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 10000
i am using flume to consume avro data from Kafka and using HDFS sink i am storing that in HDFS.
i am trying to create a hive table over the avro data in HDFS.
i can do msck repair table and could see partitions added to the hive metastore
so when i do select * from table_name limit 1; it fetches me record
but when i try fetching anything more than that i get
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
i have also tried giving the below props
flume1.sinks.hdfs-sink-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
flume1.sinks.hdfs-sink-1.deserializer.schemaType = LITERAL
flume1.sinks.hdfs-sink-1.serializer.schemaURL = file:///schemadirectory
P.S I use kafka connect to push data to Kafka topic.
I am trying windowed count with word count example. It works fine except that output is partially unreadable.
Code:
StringSerializer stringSerializer = new StringSerializer();
StringDeserializer stringDeserializer = new StringDeserializer();
WindowedSerializer<String> windowedSerializer = new WindowedSerializer<>(stringSerializer);
WindowedDeserializer<String> windowedDeserializer = new WindowedDeserializer<>(stringDeserializer);
Serde<Windowed<String>> windowedSerde = Serdes.serdeFrom(windowedSerializer, windowedDeserializer);
TimeWindows window = TimeWindows.of(TimeUnit.MINUTES.toMillis(1)).advanceBy(TimeUnit.MINUTES.toMillis(1));
KStream<String, String> textLines = builder.stream("streams-plaintext-input");
KTable<Windowed<String>, Long> wordCounts = textLines
.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+")))
.groupBy((key, word) -> word)
.windowedBy(window)
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("counts-store"));
wordCounts.toStream().to("streams-plaintext-output", Produced.with(windowedSerde, Serdes.Long()));
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
Output:
kafka c[?? 1
yaya c[?? 1
kafka c[?? 2
I guess the unreadable part might be windows duration.
What can I do to let it readable?
EDIT:
Tried to use windowedSerde to print output:
KStream<Windowed<String>, Long> output = builder.stream("streams-plaintext-output");
output.print(windowedSerde, Serdes.Long());
It still doesn't work.
When reading from the topic you need to use a Deserializer appropriate for Serializer that was used to produce to the topic. In this case, you need to use the windowDeserializer, which you are already constructing like so:
WindowedDeserializer<String> windowedDeserializer = new WindowedDeserializer<>(stringDeserializer);
Application is reading messages from one Kafka topic and after storing in MongoDB and doing some validations it is writing into another topic. Here I am facing issue like application is going into infinite loop.
Code I have is below:
Hosts zkHosts = new ZkHosts("localhost:2181");
String zkRoot = "/brokers/topics" ;
String clientRequestID = "reqtest";
String clientPendingID = "pendtest";
SpoutConfig kafkaRequestConfig = new SpoutConfig(zkHosts,"reqtest",zkRoot,clientRequestID);
SpoutConfig kafkaPendingConfig = new SpoutConfig(zkHosts,"pendtest",zkRoot,clientPendingID);
kafkaRequestConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
kafkaPendingConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaRequestSpout = new KafkaSpout(kafkaRequestConfig);
KafkaSpout kafkaPendingSpout = new KafkaSpout(kafkaPendingConfig);
MongoBolt mongoBolt = new MongoBolt() ;
DeviceFilterBolt deviceFilterBolt = new DeviceFilterBolt() ;
KafkaRequestBolt kafkaReqBolt = new KafkaRequestBolt() ;
abc1DeviceBolt abc1DevBolt = new abc1DeviceBolt() ;
DefaultTopicSelector defTopicSelector = new DefaultTopicSelector(xyzKafkaTopic.RESPONSE.name()) ;
KafkaBolt kafkaRespBolt = new KafkaBolt()
.withTopicSelector(defTopicSelector)
.withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper()) ;
TopologyBuilder topoBuilder = new TopologyBuilder();
topoBuilder.setSpout(xyzComponent.KAFKA_REQUEST_SPOUT.name(), kafkaRequestSpout);
topoBuilder.setSpout(xyzComponent.KAFKA_PENDING_SPOUT.name(), kafkaPendingSpout);
topoBuilder.setBolt(xyzComponent.KAFKA_PENDING_BOLT.name(),
deviceFilterBolt, 1)
.shuffleGrouping(xyzComponent.KAFKA_PENDING_SPOUT.name()) ;
topoBuilder.setBolt(xyzComponent.abc1_DEVICE_BOLT.name(),
abc1DevBolt, 1)
.shuffleGrouping(xyzComponent.KAFKA_PENDING_BOLT.name(),
xyzDevice.abc1.name()) ;
topoBuilder.setBolt(xyzComponent.MONGODB_BOLT.name(),
mongoBolt, 1)
.shuffleGrouping(xyzComponent.abc1_DEVICE_BOLT.name(),
xyzStreamID.KAFKARESP.name());
topoBuilder.setBolt(xyzComponent.KAFKA_RESPONSE_BOLT.name(),
kafkaRespBolt, 1)
.shuffleGrouping(xyzComponent.abc1_DEVICE_BOLT.name(),
xyzStreamID.KAFKARESP.name());
Config config = new Config() ;
config.setDebug(true);
config.setNumWorkers(1);
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:9092");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
config.put(KafkaBolt.KAFKA_BROKER_PROPERTIES, props);
LocalCluster cluster = new LocalCluster();
try{
cluster.submitTopology("demo", config, topoBuilder.createTopology());
}
In the above code, KAFKA_RESPONSE_BOLT is writing the data into topic.
abc1_DEVICE_BOLT is feeding this KAFKA_RESPONSE_BOLT by emitting the data like:
#Override
public void declareOutputFields(OutputFieldsDeclarer ofd) {
Fields respFields = IoTFields.getKafkaResponseFieldsRTEXY();
ofd.declareStream(IoTStreamID.KAFKARESP.name(), respFields);
}
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
List<Object> newTuple = new ArrayList<Object>() ;
String params = tuple.getStringByField("params") ;
newTuple.add(3, params);
----
collector.emit(IoTStreamID.KAFKARESP.name(), newTuple);
}
I have been bothered by the same question for a long time, the answer is very simple... you will not believe it .
As far as I understand,implementation of KafkaBolt have to receive tuples has field name of “message”,no matter it is a Bolt or Spout.So you have to do some changes to your code, which I have not seen carefully.(But I believe this would help!)
The specific reason are said at https://mail-archives.apache.org/mod_mbox/incubator-storm-user/201409.mbox/%3C6AF1CAC6-60EA-49D9-8333-0343777B48A7#andrashatvani.com%3E