Hi I am new developer to Apache Flink. I am trying to build a Data-streaming Application in Apache Flink integrated with Kafka.
I already an Have a Topic named source-json created in Kafka containing a json which i am required to parse using Apache Flink(Scala).
I have already written the code to extract the Json in Apacheflink facing issue in parsing the json in apache Flink using the deseralizer for spark i apparently i could not find much support for apache flink spark in docs.
json file i am Trying to parse
{
"timestamp": "2020-01-09",
"integrations": {},
"context": {
"page": {
"path": "/",
"title": "shop",
"url": "www.example.com"
},
"library": {
"version": "3.10.1"
}
},
"properties": {
"url": "https://www.example.com/",
"page_type": "Home",
},
"Time": "2020-01-09",
"_sorting": {
"SortAs": ["SGML"],
"Acronym": ["SGML"]
}
}
I was able to write the code for print the json from kafka using flink.
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("source-json")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] = env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("running stream")
}
Need help in parsing the json storing the filtered json in a different topic in kafka named sink-json in apache flink.
Related
I am trying to use the Kafka HDFS 3 sink connector to write protobuf binary files to the HDFS. However, the connector keeps writing avro files.
I have setup my sink connector with the following config
{
"name": "hdfs3-connector-test",
"config": {
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"tasks.max": "1",
"topics": "testproto",
"hdfs.url": "hdfs://10.8.0.1:9000",
"flush.size": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.protobuf.ProtobufConverter",
"value.converter.schema.registry.url":"http://10.8.0.1:8081",
"confluent.topic.bootstrap.servers": "10.8.0.1:9092",
"confluent.topic.replication.factor": "1",
"key.converter.schemas.enable": "true",
"value.converter.schemas.enable": "true"
}
}
as you can see I am using the ProtobufConverter for the value converter, and the plugin is installed. (Is the ProtobufConverter converting to Avro file format?).
I also registered my schema and sent data to the topic using the following Java file:
package app;
import java.util.Properties;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.producer.KafkaProducer;
import test.Test.*;
public class App
{
public static void main( String[] args )
{
try {
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.8.0.1:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer.class.getName());
props.put("schema.registry.url", "http://10.8.0.1:8081");
KafkaProducer<String, MyMsg> producer = new KafkaProducer<String, MyMsg>(props);
String topic = "testproto";
String key = "testkey";
MyMsg m = MyMsg.newBuilder().setF1("testing").build();
ProducerRecord<String, MyMsg> record = new ProducerRecord<String, MyMsg>(topic, key, m);
producer.send(record).get();
producer.close();
} catch (Exception e) {
System.out.println(e.toString());
}
}
}
here is my proto file
syntax = "proto3";
package test;
message MyMsg {
string f1 = 1;
}
So my question is, is this correct? Can I only write Avro files to the HDFS using this connector? Or is my configuration incorrect and I should be expecting protobuf files in the HDFS?
You need to set format.class config,
format.class is
The format class to use when writing data to the store. Format classes implement the io.confluent.connect.storage.format.Format interface.
Type: class
Default: io.confluent.connect.hdfs3.avro.AvroFormat
Importance: high
These classes are available by default:
io.confluent.connect.hdfs3.avro.AvroFormat
io.confluent.connect.hdfs3.json.JsonFormat
io.confluent.connect.hdfs3.parquet.ParquetFormat
io.confluent.connect.hdfs3.string.StringFormat
https://docs.confluent.io/kafka-connect-hdfs3-sink/current/configuration_options.html#hdfs3-config-options
I use kafka monitor the alteration of LocalFile and SparkStreaming to analyse . But I can't extrct the data from the kafka because the format of data is JSON .
When I tap the command bin/kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic kafka-streaming --from-beginning,
THE FORMAT OF DATA IS:
{
"schema": {
"type": "string",
"optional": false
},
"payload": "{\"like_count\": 594, \"view_count\": 49613, \"user_name\": \" w\", \"play_url\": \"http://upic/2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-0-gjcfcmzmef-954d5652f100c12e\", \"description\": \"ţ ų ඣ 9 9 9 9\", \"cover\": \"http://uhead/AB/2016/03/09/18/BMjAxNjAzMDkxODI1MzNfMjA3ODc2NTY2XzJfaGQ5OQ==.jpg\", \"video_id\": 5235997527237673952, \"comment_count\": 39, \"download_url\": \"http://2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-1-zdpjkouqke-5862405191e4c1e4\", \"user_id\": 207876566, \"video_create_time\": \"2019-04-08 12:48:21\", \"user_sex\": \"F\"}"
}
The version of spark is 2.3.0 and the kafka version is 1.1.0. The version of spark-streaming-kafka is 0-10_2.11-2.3.0.
The JSON data in the column of PAYOAD is I want to deal with and analyse. How can I change the codes to acquire the JSON data
Use org.apache.kafka.common.serialization.StringDeserializer and org.apache.kafka.common.serialization.StringSerializer for consuming and sending data to kafka topic respectively.
This way you will get a String on consumption which can very easily be converted to JSON Object using JSONParser
I am getting the following error message when I tried to write avro records using build-in AvroKeyValueSinkWriter in Flink 1.3.2 and avro 1.8.2:
My schema looks like this:
{"namespace": "com.base.avro",
"type": "record",
"name": "Customer",
"doc": "v6",
"fields": [
{"name": "CustomerID", "type": "string"},
{"name": "platformAgent", "type": {
"type": "enum",
"name": "PlatformAgent",
"symbols": ["WEB", "MOBILE", "UNKNOWN"]
}, "default":"UNKNOWN"}
]
}
And I am calling the following Flink code to write data:
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = Customer.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
val sink = new BucketingSink[org.apache.flink.api.java.tuple.Tuple2[String, Customer]]("s3://test/flink")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd/HH/mm/"))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
sink.setPendingSuffix(".avro")
val writer = new AvroKeyValueSinkWriter[String, Customer](properties)
sink.setWriter(writer.duplicate())
However, it throws the following errors:
Caused by: org.apache.avro.AvroTypeException: Not an enum: MOBILE
at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:177)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:119)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 10 more
Please suggest!
UPDATE 1:
I found this is kind of bug in avro 1.8+ based on this ticket: https://issues-test.apache.org/jira/browse/AVRO-1810
It turns out this is an issue with Avro 1.8+, I have to override the version flink uses dependencyOverrides += "org.apache.avro" % "avro" % "1.7.3", the bug can be found here https://issues-test.apache.org/jira/browse/AVRO-1810
I am a beginner with both Logstash and Avro.
We are setting up a system with logstash as producer for a kafka queue. However, we are running into the problem that the avro serialized events produced by Logstash cannot be decoded by the avro-tools jar (version 1.8.2) that apache provides. Furthermore, we notice that the serialized output by Logstash and avro-tools differs.
We have the following setup:
logstash version 5.5
logstash avro codec version 3.2.1
kafka version 0.10.1
avro-tools jar version 1.8.2
As example, consider the following schema:
{
"name" : "avroTestSchema",
"type" : "record",
"fields" : [ {
"name" : "testfield1",
"type" : "string"
},
{
"name" : "testfield2",
"type" : "string"
}
]
}
and the following json string:
{"testfield1":"somestring","testfield2":"anotherstring"}
When serializing using Logstash.
Logstash config file:
input {
stdin {
codec => json
}
}
filter {
mutate {
remove_field => ["#timestamp", "#version"]
}
}
output {
kafka {
bootstrap_servers => "localhost:9092"
codec => avro {
schema_uri => "/path/to/TestSchema.avsc"
}
topic_id => "avrotestout"
}
stdout {
codec => rubydebug
}
}
output (using cat):
FHNvbWVzdHJpbmcaYW5vdGhlcnN0cmluZw==
When serializing using avro-tools.
command:
java -jar avro-tools-1.8.2.jar jsontofrag --schema-file TestSchema.avsc message.json
output
somestringanotherstring
command:
java -jar avro-tools-1.8.2.jar fromjson --schema-file TestSchema.avsc message.json
output:
Objavro.codenullavro.schema▒{"type":"record","name":"avroTestSchema","fields":[{"name":"testfield1","type":"string"},{"name":"testfield2","type":"string"}]}▒▒▒▒&70▒▒Hs▒U2somestringanotherstring▒▒▒▒&70▒▒Hs▒U
So our question is:
How do we configure Logstash such that the output becomes compatible with the apache avro-tools jar?
UPDATE: We found out that the logstash produced avro output is base64 encoded. However cannot find where this happens, and how to make it avro-tools compatible
As mentioned in the update, we found out that the standard Logstash Avro codec adds a non optional base64 encoding to the avro output. We found this undesirable. So we forked the codec and made this encoding configurable. We tested this and it worked out of the box on several of our systems.
The fork is available on github: https://github.com/Rubyan/logstash-codec-avro
To set (or unset) the base64 encoding, add this to your logstash config file:
output {
stdout {
codec => avro {
schema_uri => "schema.avsc"
base64_encoding => false
}
}
}
There is an M2M Application which wants to talk to the temperature sensors on the field, i.e. send/receive messages using MQTT pub/sub protocol.
I have setup both IOTDM as well as one with eclipse OneM2M using Mosquito. But, I am looking for some sample APIs/commands through which a M2M application can send a message to the MQTT client and vice versa.
Or if any of you could point me to the appropriate call flows that would be helpful.
Any help would be highly appreciated.
Here is a GET MQTT message example:
topic: /oneM2M/req/{{origin}}/{{cse-id}}/json
message:
{
"m2m:rqp": {
"op": "2",
"to": "{{resource_uri}}",
"fr": "{{origin}}",
"rqi": 12345,
"pc": ""
}
}
{{resource_uri}} is the relative path of a resource existing on the
oneM2M server (e.g. /my_cse_base/my_ae)
{{origin}} is the origin enabled (by ACP) to retrieve the resource
{{cse-id}} is the CSEbase ID
The message received could be similar to:
topic: /oneM2M/resp/{{origin}}/{{cse-id}}/json
message:
{
"m2m:rsp": {
"rsc": 2000,
"rqi": 12345,
"pc": {
"m2m:ae": {
"pi": "Sy2XMSpbb",
"ty": 2,
"ct": "20170706T085259",
"ri": "r1NX_cOiVZ",
"rn": "my_ae",
"lt": "20170706T085259",
"et": "20270706T085259",
"acpi": ["/my_cse_base/acp_my_ae"],
"aei": "my_ae_id",
"rr": true
}
}
}
}
A POST example:
topic: /oneM2M/req/{{origin}}/{{cse-id}}/json
message:
{
"m2m:rqp": {
"op": "1",
"to": "{{resource_uri}}",
"fr": "{{origin}}",
"rqi": 12345,
"ty": "4",
"pc": {
"m2m:cin": {
"cnf": "text/plain:0",
"con": "123",
"lbl": ["test"]
}
}
}
}
{{resource_uri}} is the relative path of a resource existing on the
oneM2M server (e.g. /my_cse_base/my_ae)
{{origin}} is the origin enabled (by ACP) to create a new resource
{{cse-id}} is the CSEbase ID
For an JS speach i made an app for mesure the soil moisture. I used MQTT for send information from my Arduino to server written in NodeJS. I don't know if you have some skills on JS. You can see the cond on my github repo . I hope this solution can help you.