how to remove field in logstash output - apache-kafka

I have set up an ELK stack. For the logstash instance, it has two output including Kafka and elasticsearch.
For the output of elasticsearch, I want to keep the field #timestamp. For the output of Kafka, I want to remove the field #timestamp. So I cannot just remove field #timestamp in the filter. I just want it removed for the Kafka output.
I have not found this kind of solution.
append
Try to use clone plugin:
clone {
clones => ["kafka"]
id => ["kafka"]
remove_field => ["#timestamp"]
}
output {
if [type] != "kafka" {
elastcsearch output
}
if [type] == "kafka" {
kafka output
}
}
It's strange that the output of elasticsearch can work. But it cannot output to kafka. And I have tried to judge by id, still does not wordk.

Since you can only remove fields in the filter block, to have the same pipeline output two different versions of the same event you will need to clone your events, remove the field in the cloned event and use conditionals in the output.
To clone your event and remove the #timestamp field you will need something like this in your filter block.
filter {
# your other filters
#
clone {
clones => ["kafka"]
}
if [type] == "kafka" {
mutate {
remove_field => ["#timestamp"]
}
}
}
This will clone the event and the cloned event will have the value kafka in the field type, you will then use this field in the conditionals in your output.
output {
if [type] != "kafka" {
your elasticsearch output
}
if [type] == "kafka" {
your kafka output
}
}

Related

logstash date filter add_field is not working

I"m connecting to postgres and writing a few rows to elastic via logstash.
same date read/write is working fine.
After I apply a date fileter, fetch a date field and assign it to newly created field, it's not working. Below is the filter
filter {
date {
locale => "en"
match => ["old_date", "YYYY-MM-dd"]
timezone => "Asia/Kolkata"
add_field => { "newdate" => "2022-10-06"}
target => "#newdate"
}
}
I tried with mutate also but the new field is not getting created and there's no any error

How to update data in Elasticsearch on a schedule?

I have a table in the PostgreSQL database. I want to insert data from that table into the Elasticsearch's index. I need to update index data on a schedule. In other words, deletes old data and inserts with a new one. I have such Logstash configuration file but it doesn't update data in index. It's insert data but in the same time I see old data. Therefore, duplicate data occurs. How correctly to update data in Elasticsearch on a schedule?
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://host:port/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_driver_library => "postgresql-42.2.9.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * FROM layers;"
schedule => "0 0 * * MON"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "layers"
}
}
You index name doesnt change, so everytime you add new records, it adds to same index.
Add a datetime postfix to the index
index => "layers%{+YYYY.MM.dd}"
So there'll be a new index for each date.
Now for searching, create an alias , so you can always use the same name in your application. For example: layers/_search by adding alias like below:
POST _aliases
{
"actions": [
{
"add": {
"index": "layers-2019.12.11",
"alias": "layers"
}
}
]
}
Above step is via kibana or you can use http post. However, i'd recommend using Curator for alias operations. That way once, log stash command completes, you can run curator to remove current index from the alias and add the newly created one.

Two outputs in logstash. One for certain aggregations only

I'm trying to specify a second output of logstash in order to save certain aggregated data only. No clue how to achieve it at the moment. Documentation doesn't cover such a case.
At the moment I use a single input and a single output.
Input definition (logstash-udp.conf):
input {
udp {
port => 25000
codec => json
buffer_size => 5000
workers => 2
}
}
filter {
grok {
match => [ "message", "API call happened" ]
}
aggregate {
task_id => "%{example_task}"
code => "
map['api_calls'] ||= 0
map['api_calls'] += 1
map['message'] ||= event.get('message')
event.cancel()
"
timeout => 60
push_previous_map_as_event => true
timeout_code => "event.set('aggregated_calls', event.get('api_calls') > 0)"
timeout_tags => ['_aggregation']
}
}
Output definition (logstash-output.conf):
output {
elasticsearch {
hosts => ["localhost"]
manage_template => false
index => "%{[#metadata][udp]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
}
What I want to achieve now? I need to add a second, different aggregation (different data and conditions) which will save all the not aggregated data to Elasticsearch like now however aggregated data for this aggregation would be saved to Postgres. I'm pretty much stuck at the moment and searching the web for some docs/examples doesn't help.
I'd suggest using multiple pipelines: https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
This way you can have one pipeline for aggregation and second one for pure data.

How to get total number of commits using GitHub API

I am trying to collect some statistics about our project repositories on GitHub. I am able to get total number of commits for each contributor , but it is for default branch.
curl https://api.github.com/repos/cms-sw/cmssw/stats/contributors
The problem is , how can i get the same info for non-default branches , where i can specify a branch name. Is any such operation possible using GitHub API ?
thanks.
You should be able to use GitHub's GraphQL API to get at this data, although it won't be aggregated for you.
Try the following query in their GraphQL Explorer:
query($owner:String!, $name:String!) {
repository(owner:$owner,name:$name) {
refs(first:30, refPrefix:"refs/heads/") {
edges {
cursor
node {
name
target {
... on Commit {
history(first:30) {
edges {
cursor
node {
author {
email
}
}
}
}
}
}
}
}
}
}
}
With these variables:
{
"owner": "rails",
"name": "rails"
}
That will list out each of the author emails for each of the commits of each of the branches in a given repository. It would be up to you to paginate over the data (adding something like cursor: "b7aa251234357f7ddddccabcbce332af39dd95f6" after the first:30 arguments). You'd also have to aggregate the counts on your end.
Hope this helps.

Mongodb group by on internal element

I have data in following format
{
_id:ObjectId("someid"),
"masterKey":{
"key1":"val1",
"key2":"val2",
"key3":"val3",
"key4":"val1",
"key5":"val2",
"key6":"val3",
}
}
And I am expecting result which group the duplicate values of master key.
The result should be similar to this. Or in any other format which finds duplicate values of keys with key name.
{
_id:Object("someid"),
"masterKey":{
"val1":["key1","key4"],
"val2":["key2","key5"],
"val3":["key3","key6"]
}
}