In Opensearch can I create rollup jobs which include a filter or search term - opensearch

I am aggregating logs which have metrics embedded in them, such as the time it took for a service to complete etc. I want to perform rollup jobs which only include these metrics. Is it possible to create a rollup job which only includes a subset of documents in an index?

Related

Citus data - How to query data from a single shard in a query?

We are evaluating Citus data for the large-scale data use cases in our organization. While analyzing, I am trying to see if there is a way to achieve the following with Citus data:
We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly
So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only. This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.
I am not able to find anything from the documentation on how we can achieve this. Is this possible with Citus data?
Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case.
To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.
One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column. So that your application knows which rows should be fetched/updated from which sub-task. In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column. In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard; by assigning them the shard_id you want.
Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant
It basically isolates a tenant (all data with the same customer_id in your case) into a single shard. Maybe it works for you at some point.

Filtering data from elastic search based on mongodb

I've a list of items in my ElasticSearch. User enters a query and I fetch the results from elastic search. Now, I've some user preferences stored in mongodb based on which I want to filter the results of elastic search.
Suppose, I get a list of items(item_ids) from Elasticsearch.
Mongo DB has following schema.
id, user_id, item_id
I choose this MongoDB schema because a user could have a very big list of items(in order of Millions), which he doesn't want to see in results.
How do I achieve this with scale? Do I need to change my schema?
You should use elasticsearch filtering for this, you can include the filter criteria in your ES query which would reduce the number of results to return without which
You have to return huge data set from ES and then do the filtering in MongoDB which is two step process and costly at both ES and mongo side.
With filters at ES, it would return less data which would avoid extra post-processing at mongoDB and filters are executed first and by default cached at elasticsearch side so you don't need further caching solution like redis etc.
Refer filter and query context and from same official doc, info about filter cache.
Frequently used filters will be cached automatically by Elasticsearch,
to speed up performance.

Continuously run MongoDB aggregation pipeline

I have an ETL pipeline that is sinking timeseries records to MongoDB.
I need to compute timely aggregations for daily, weekly and the like. I assumed the aggregations engine of MongoDB would be the way to go, so after I had the aggregation queries for each resolution I wrapped them with MongoDB views like "daily_view", "weekly_view", etc.
There is REST service to fetch from MongoDB. Depending on what period resolution is requested, it pulls from the different aforementioned views, sampling for start and end dates.
The response times are quite "poor" with these views/aggregations. It can be around 10-15 seconds. I take this lapse might not be outrageous for batch computing a report, but in my case the service needs to issue these requests in a live mode to serve the frontend, so 10 seconds wait is too much.
From the MongoDB reference I know that Views are computed on demand during read operations but I'm a bit disappointed with such response times because the same aggregations took split seconds in Elasticsearch or InfluxDB, which unfortunately are not an option for me at the moment.
I have also exhausted the research about optimizing the queries and there is no room from more improvement there than the way it already is.
My intuition tells me that if the aggregations have to be done via the aggregations engine, I need the pipelines executing continuously on the fly (so the views have records already in for the service), as opposed to be run everytime ad-hoc.
I've tried to drop the views, and instead have and aggregation with a last stage being an $out to a real collection ...but I have still the same problem, it needs to be run "on demand". I composed the pipelines using the Compass UI, and in the $out stage it presents a button to run the aggregation.
Would there be a way to schedule such pipelines/aggregation queries??
Something I can think about is, copy-pasting the code of the aggregations and make it into Javascript functions of the REST service ...but still, something would have to invoke those functions on a regular interval. I know there are libraries I can bring into the service for scheduling, but this option makes me feel a bit discomforted in terms of architecture.
In the worst case scenario, my backup plan is to implement the timely aggregations as part as the logic of the initial ETL and sink all the different resolutions to different collections, so the service will find records to fetch already waiting in the aggregated collections. But the intention was to leverage time aggregations to the datastore engine.
I'm having a bit of last minute architecture distress now
$out aggregation stage. Documentation.
Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline.
$mongo accepts javascript file as an argument. So this is the easiest way to package your aggregation. Reference.
mongo file.js --username username --password
Then - to execute it on schedule - common tools like cron jobs to the rescue.
You might need to account for the differences between Mongo Shell and Javascrips such as using db = db.getSiblingDB('<db>') instead of use <db>. Write Scripts for the mongo Shell

Azure Rest API - Get all vms from resource ordered

I am trying to make a simple ranking (like a top 10) of the vms that I have in a specific resource. For example the top 10 vms in percentage cpu metric. Right now what I am doing is to collect the metrics of each one individually and then comparing them with the others. I couldn't find anything in the api that would make this less rudimentary, do you know any query or filter that approximates of what I am requesting?
Thanks!
Today the REST API for metrics can't do any aggregation across multiple resources.

Restrict querying MongoDB collection to inactive chunks only

I am building an application which will perform 2 phases.
Execute Phase - First phase is very
INSERT intensive (as many inserts
as the hardware can possibly can
execute in a second). This is
essentially a logging trail of work
performed.
Validation Phase - Next
phase will query the logs generated
by phase 1 and compare to an
external source and perform an
UPDATE on the record to store some
statistics. This process is second priority to phase 1.
I'm trying to see if its feasible to do them in parallel and keep write locking to a minimum for the execution phase. I thought one way to do this would be to restrict my Validation phase to only query from older records which are not in the chunk currently being inserted to by the execution phase. Is there something in MongoDB that restricts a find() to only query from chunks that have not been accessed in some configurable amount of time?
You probably want to set up replica set. Insert into the master and fetch from secondaries. In that way, your insert won't be blocked at all.
You can use the mentioned replica set with slaveOk, and update in the master.
You can use a timestamp field or an ObjectId (which already contains a timestamp) for filtering.