Logstash avro output cannot be decoded by apache avro-tools - apache-kafka

I am a beginner with both Logstash and Avro.
We are setting up a system with logstash as producer for a kafka queue. However, we are running into the problem that the avro serialized events produced by Logstash cannot be decoded by the avro-tools jar (version 1.8.2) that apache provides. Furthermore, we notice that the serialized output by Logstash and avro-tools differs.
We have the following setup:
logstash version 5.5
logstash avro codec version 3.2.1
kafka version 0.10.1
avro-tools jar version 1.8.2
As example, consider the following schema:
{
"name" : "avroTestSchema",
"type" : "record",
"fields" : [ {
"name" : "testfield1",
"type" : "string"
},
{
"name" : "testfield2",
"type" : "string"
}
]
}
and the following json string:
{"testfield1":"somestring","testfield2":"anotherstring"}
When serializing using Logstash.
Logstash config file:
input {
stdin {
codec => json
}
}
filter {
mutate {
remove_field => ["#timestamp", "#version"]
}
}
output {
kafka {
bootstrap_servers => "localhost:9092"
codec => avro {
schema_uri => "/path/to/TestSchema.avsc"
}
topic_id => "avrotestout"
}
stdout {
codec => rubydebug
}
}
output (using cat):
FHNvbWVzdHJpbmcaYW5vdGhlcnN0cmluZw==
When serializing using avro-tools.
command:
java -jar avro-tools-1.8.2.jar jsontofrag --schema-file TestSchema.avsc message.json
output
somestringanotherstring
command:
java -jar avro-tools-1.8.2.jar fromjson --schema-file TestSchema.avsc message.json
output:
Objavro.codenullavro.schema▒{"type":"record","name":"avroTestSchema","fields":[{"name":"testfield1","type":"string"},{"name":"testfield2","type":"string"}]}▒▒▒▒&70▒▒Hs▒U2somestringanotherstring▒▒▒▒&70▒▒Hs▒U
So our question is:
How do we configure Logstash such that the output becomes compatible with the apache avro-tools jar?
UPDATE: We found out that the logstash produced avro output is base64 encoded. However cannot find where this happens, and how to make it avro-tools compatible

As mentioned in the update, we found out that the standard Logstash Avro codec adds a non optional base64 encoding to the avro output. We found this undesirable. So we forked the codec and made this encoding configurable. We tested this and it worked out of the box on several of our systems.
The fork is available on github: https://github.com/Rubyan/logstash-codec-avro
To set (or unset) the base64 encoding, add this to your logstash config file:
output {
stdout {
codec => avro {
schema_uri => "schema.avsc"
base64_encoding => false
}
}
}

Related

How can I acqurie the JSON data from Kafka using SparkStreaming

I use kafka monitor the alteration of LocalFile and SparkStreaming to analyse . But I can't extrct the data from the kafka because the format of data is JSON .
When I tap the command bin/kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic kafka-streaming --from-beginning,
THE FORMAT OF DATA IS:
{
"schema": {
"type": "string",
"optional": false
},
"payload": "{\"like_count\": 594, \"view_count\": 49613, \"user_name\": \" w\", \"play_url\": \"http://upic/2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-0-gjcfcmzmef-954d5652f100c12e\", \"description\": \"ţ ų ඣ 9 9 9 9\", \"cover\": \"http://uhead/AB/2016/03/09/18/BMjAxNjAzMDkxODI1MzNfMjA3ODc2NTY2XzJfaGQ5OQ==.jpg\", \"video_id\": 5235997527237673952, \"comment_count\": 39, \"download_url\": \"http://2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-1-zdpjkouqke-5862405191e4c1e4\", \"user_id\": 207876566, \"video_create_time\": \"2019-04-08 12:48:21\", \"user_sex\": \"F\"}"
}
The version of spark is 2.3.0 and the kafka version is 1.1.0. The version of spark-streaming-kafka is 0-10_2.11-2.3.0.
The JSON data in the column of PAYOAD is I want to deal with and analyse. How can I change the codes to acquire the JSON data
Use org.apache.kafka.common.serialization.StringDeserializer and org.apache.kafka.common.serialization.StringSerializer for consuming and sending data to kafka topic respectively.
This way you will get a String on consumption which can very easily be converted to JSON Object using JSONParser

Logstash not forwarding logs to ES via Kafka

I'm using ELK 5.0.1 and Kafka 0.10.1.0 . I'm not sure why my logs aren't forwarding I installed Kafkacat and was successfully able to Produce and Consume logs from all the 3 servers where Kafka cluster is installed.
shipper.conf
input {
file {
start_position => "beginning"
path => "/var/log/logstash/logstash-plain.log"
}
}
output {
kafka {
topic_id => "stash"
bootstrap_servers => "<i.p1>:9092,<i.p2>:9092,<i.p3>:9092"
}
}
receiver.conf
input {
kafka {
topics => ["stash"]
group_id => "stashlogs"
bootstrap_servers => "<i.p1>:2181,<i,p2>:2181,<i.p3>:2181"
}
}
output {
elasticsearch {
hosts => ["<eip>:9200","<eip>:9200","<eip>:9200"]
manage_template => false
index => "logstash-%{+YYYY.MM.dd}"
}
}
Logs: Getting the below warnings in logstash-plain.log
[2017-04-17T16:34:28,238][WARN ][org.apache.kafka.common.protocol.Errors] Unexpected error
code: 38.
[2017-04-17T16:34:28,238][WARN ][org.apache.kafka.clients.NetworkClient] Error while fetching
metadata with correlation id 44 : {stash=UNKNOWN}
It looks like your bootstrap servers are using zookeeper ports. Try using Kafka ports (default 9092)

Can’t get Logstash to read existing Kafka topic from start

I'm trying to consume a Kafka topic using Logstash, for indexing by Elasticsearch. The Kafka events are JSON documents.
We recently upgraded our Elastic Stack to 5.1.2.
I believe that I was able to consume the topic OK in 5.0, using the same settings, but that was a while ago so perhaps I'm doing something wrong now, but can't see it. This is my config (slightly sanitized):
input {
kafka {
bootstrap_servers => "host1:9092,host2:9092,host3:9092"
client_id => "logstash-elastic-5-c5"
group_id => "logstash-elastic-5-g5"
topics => "trp_v1"
auto_offset_reset => "earliest"
}
}
filter {
json {
source => "message"
}
mutate {
rename => { "#timestamp" => "indexedDatetime" }
remove_field => [
"#timestamp",
"#version",
"message"
]
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
hosts => ["host10:9200", "host11:9200", "host12:9200", "host13:9200"]
action => "index"
index => "trp-i"
document_type => "event"
}
}
When I run this, no messages are consumed, no sign of activity appears in the log after "[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Setting newly assigned partitions", and in Kafka Manager the consumer appears to immediately appear with "total lag = 0" for the topic.
This version of the Kafka plugin stores consumer offsets in Kafka itself, so each time I try to run Logstash against the same topic, I increment the group_id so in theory, it should start from the earliest offset for the topic.
Any advice?
EDIT: It appears that despite setting auto_offset_reset to "earliest", it isn't working - it's as if it's being set to "latest". I left Logstash running, then had more entries loaded into the Kafka queue, and they were processed by Logstash.

Reading from multiple topics in Apache Kafka

I'm trying to read from multiple kafka topics (say 'newtest-1' and 'newtest-2') using 'white_list' configuration in the logstash input plugin. My logstash conf looks like:
input { kafka { white_list => "newtest-1|newtest-2" } } output { stdout {codec => rubydebug } }
With this configuration I can successfully read from two different topics. But I want to use regex for input topics as I'm expecting the topics to be of the form 'newtest-*'. According to the suggestion in this link, the following configuration should work:
input { kafka { white_list => "newtest-*" } } output { stdout {codec => rubydebug } }
But with this I'm not able to read from kafka. Any help is appreciated.
The white_list should be newtest-.*
This is relevant to older versions of the plugin. Now you can use topics.

logstash mongo Db connection issue

I am unable to push data to mongo Db using logstash
My config file looks like:-
input {
file {
type => "log"
path => "d:\logs\*.txt"
}
}
output {
mongodb {
database => "abhi1"
collection => "plain"
uri => "mongodb://127.0.0.1:27017"
}
}
command used for executing configuration file is logstash -f ./conf/demo.conf
ERROR :-
[2015-09-08T16:26:04.883000 #4528] DEBUG -- : MONGODB | COMMAND | namespace=a
in.$cmd selector={:ismaster=>1} flags=[] limit=-1 skip=0 project=nil | runtime
46.9999ms
hoping to get a workaround soon. thanks