Two outputs in logstash. One for certain aggregations only - postgresql

I'm trying to specify a second output of logstash in order to save certain aggregated data only. No clue how to achieve it at the moment. Documentation doesn't cover such a case.
At the moment I use a single input and a single output.
Input definition (logstash-udp.conf):
input {
udp {
port => 25000
codec => json
buffer_size => 5000
workers => 2
}
}
filter {
grok {
match => [ "message", "API call happened" ]
}
aggregate {
task_id => "%{example_task}"
code => "
map['api_calls'] ||= 0
map['api_calls'] += 1
map['message'] ||= event.get('message')
event.cancel()
"
timeout => 60
push_previous_map_as_event => true
timeout_code => "event.set('aggregated_calls', event.get('api_calls') > 0)"
timeout_tags => ['_aggregation']
}
}
Output definition (logstash-output.conf):
output {
elasticsearch {
hosts => ["localhost"]
manage_template => false
index => "%{[#metadata][udp]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
}
What I want to achieve now? I need to add a second, different aggregation (different data and conditions) which will save all the not aggregated data to Elasticsearch like now however aggregated data for this aggregation would be saved to Postgres. I'm pretty much stuck at the moment and searching the web for some docs/examples doesn't help.

I'd suggest using multiple pipelines: https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
This way you can have one pipeline for aggregation and second one for pure data.

Related

how to remove field in logstash output

I have set up an ELK stack. For the logstash instance, it has two output including Kafka and elasticsearch.
For the output of elasticsearch, I want to keep the field #timestamp. For the output of Kafka, I want to remove the field #timestamp. So I cannot just remove field #timestamp in the filter. I just want it removed for the Kafka output.
I have not found this kind of solution.
append
Try to use clone plugin:
clone {
clones => ["kafka"]
id => ["kafka"]
remove_field => ["#timestamp"]
}
output {
if [type] != "kafka" {
elastcsearch output
}
if [type] == "kafka" {
kafka output
}
}
It's strange that the output of elasticsearch can work. But it cannot output to kafka. And I have tried to judge by id, still does not wordk.
Since you can only remove fields in the filter block, to have the same pipeline output two different versions of the same event you will need to clone your events, remove the field in the cloned event and use conditionals in the output.
To clone your event and remove the #timestamp field you will need something like this in your filter block.
filter {
# your other filters
#
clone {
clones => ["kafka"]
}
if [type] == "kafka" {
mutate {
remove_field => ["#timestamp"]
}
}
}
This will clone the event and the cloned event will have the value kafka in the field type, you will then use this field in the conditionals in your output.
output {
if [type] != "kafka" {
your elasticsearch output
}
if [type] == "kafka" {
your kafka output
}
}

How to update data in Elasticsearch on a schedule?

I have a table in the PostgreSQL database. I want to insert data from that table into the Elasticsearch's index. I need to update index data on a schedule. In other words, deletes old data and inserts with a new one. I have such Logstash configuration file but it doesn't update data in index. It's insert data but in the same time I see old data. Therefore, duplicate data occurs. How correctly to update data in Elasticsearch on a schedule?
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://host:port/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_driver_library => "postgresql-42.2.9.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * FROM layers;"
schedule => "0 0 * * MON"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "layers"
}
}
You index name doesnt change, so everytime you add new records, it adds to same index.
Add a datetime postfix to the index
index => "layers%{+YYYY.MM.dd}"
So there'll be a new index for each date.
Now for searching, create an alias , so you can always use the same name in your application. For example: layers/_search by adding alias like below:
POST _aliases
{
"actions": [
{
"add": {
"index": "layers-2019.12.11",
"alias": "layers"
}
}
]
}
Above step is via kibana or you can use http post. However, i'd recommend using Curator for alias operations. That way once, log stash command completes, you can run curator to remove current index from the alias and add the newly created one.

Filter data on call to getHyperCubeData

When I run the following, I got all records from my table object (assuming i have 100 records in all). Is there a way to send the selection/filter, for example, I want to retrieve only those where department='procuring'.
table.getHyperCubeData('/qHyperCubeDef', [{
qWidth: 8,
qHeight: 100
}]).then(data => console.log(data));
I figured out the answer. Before getting the hypercube data, I need to get the field from the Doc class, then do the following:
.then(doc => doc.getField('department'))
.then(field => field.clear().then(() => field.select({qMatch: filter['procuring']})))

Order Posts by Most Votes (Overall, Last Month, etc.) with Laravel MongoDB

I am trying to understand more advanced functions of mongodb and laravel but having trouble with this. Currently I have my schema setup with a users, posts, and posts_votes collections. The posts_votes has a user_id, post_id and timestamp field.
In a relational DB, I would just left join the posts_votes collection, count, and order by that count. Exclude dates when need be and all that.
MongoDB I am having difficulty b/c there's no left join equivalent. So I'd like to learn how to accomplish my goal in a more document-y way.
On my Post model in Laravel, I reference this way. So looking at an individual post, I can get the vote count, see if current user voted for a specific post, etc.
public function votes()
{
return $this->hasMany(PostVote::class, 'post_id');
}
And my current working query looks like this:
$posts = Post::forCategoryType($type)
->with('votes', 'author', 'businessType')
->where('approved', true)
->paginate(25);
The forCategoryType method is just extended scope I added. Here it is on the Post model/document class.
public function scopeForCategoryType($builder, $catType)
{
if ($catType->exists) {
return $builder->where('cat_id', $catType->id);
}
return $builder;
}
So when I look at posts like this one, it's close to what I want to accomplish, but I am not applying it properly. For instance, I changed my main query to look like this:
$posts = Post::forBusinessType($type)
->with('votes', 'author', 'businessType')
->where('approved', true)
->sortByVotes()
->paginate(25);
And created this new method on the Post model:
public function scopeSortByVotes($builder, $dir = 'desc')
{
return $builder->raw(function($collection) {
return $collection->aggregate([
['$group' => [
'_id' => ['post_id' => 'votes.$post_id', 'user_id' => 'votes.$user_id']
],
'vote_count' => ['$sum' => 1]
],
['$sort' => ['vote_count' => -1]]
]);
});
}
This returns the error exception: A pipeline stage specification object must contain exactly one field.
Not sure how to fix that (still looking), so then I tried:
return $collection->aggregate([
['$unwind' => '$votes'],
['$group' => [
'_id' => ['post_id' => ['$votes.post_id', 'user_id' => '$votes.user_id']],
'count' => ['$sum' => 1]
]
]
]);
returns an empty ArrayIterator, so then I tried:
public function scopeSortByVotes($builder, $dir = 'desc')
{
return $builder->raw(function($collection) {
return $collection->aggregate([
'$lookup' => [
'from' => 'community_posts_votes',
'localField' => 'post_id',
'foreignField' => '_id',
'as' => 'vote_count'
]
]);
});
}
But on this setup, I just get the list of posts unsorted. On version 3.2.8.
The default loads everything by most recent. But ultimately I want to be able to pull these posts based on how many votes they got lifetime, but also query based on which posts got the most votes in the last week, month, etc.
That example I shared has the grand total linked in the Post model and an array of all the user ids that voted on it. With the way I have things setup using a separate collection holding the user_id, post_id and timestamps of when the vote happened, can I still accomplish the same goal?
Note: using this laravel mongodb library.

How do I mimic a "SELECT _id" in mongoid and ruby?

Currently I am doing the following:
responses = Response.where(user_id: current_user.uid)
qids = []
responses.each { |r| qids << r._id}
return qids
Any better way of doing this?
Use .only() to retrieve less data.
quids = Response.only(:_id).where(user_id: current_user.uid).map(&:_id)
Response.where(user_id: current_user.uid).map { |r| r._id }
That's a bit more idiomatic.
As far as mongoid is concerned, the only 'mapping' type functionality that mongoid offers are custom map-reduce filters. You can check out the documentation.
In this case, writing such a filter will not be to your advantage. You are loading the entire dataset (lazy-loading doesn't help) and you aren't reducing anything.
Straight to the point better solution for this problem
In case you want to get the id or something that's unique in the result set, then it is functionally equivalent to use the distinct method. That way you save the mapping operation and it seems to be much faster (The tests and why you should perhaps take them with a bit of a precaution is explained at the bottom).
Response.where(user_id: current_user.uid).distinct(:_id)
So only use this if you want to get something non-unique and for some reason want to get duplicate results. I.e. if your responses could be liked and if you wanted to get an array of all likes (say you wanted to calculate some statistics about likings):
Response.where(user_id: current_user.uid).map { |r| r.likes }
Testing...
Here's some random tests, but for more trust worthy results one should commit the tests with a large database, instead of repeating the action. I mean for all I know there can be any sort of optimizations for repeating the same query all over again (where obviously the map can't have any such optimizations).
Benchmark.measure { 1000.times { Organization.where(:event_labels.ne =[]).map(&:_id) } }
=> 6.320000 0.290000 6.610000 ( 6.871498)
Benchmark.measure { 1000.times { Organization.where(:event_labels.ne => []).only(:_id).map(&:_id) } }
=> 5.490000 0.140000 5.630000 ( 5.981122)
Benchmark.measure { 1000.times { Organization.where(:event_labels.ne => []).distinct(:_id) } }
=> 0.570000 0.020000 0.590000 ( 0.773239)
Benchmark.measure { 1000.times { Organization.where(:event_labels.ne => []).only(:_id) } }
=> 0.140000 0.000000 0.140000 ( 0.141278)
Benchmark.measure { 1000.times { Organization.where(:event_labels.ne => []) } }
=> 0.070000 0.000000 0.070000 ( 0.069482)
Doing map without only takes a bit longer, so using only is beneficial. Though using it seems to actually slightly hurt the performance if you don't do map at all, but having less data seems to make map run a bit faster. Anyhow, according to this test, it seems distinct is about 10 times faster on all metrics (user, system, total, real) than using the only and map combo though it's slower than using only without map.