MongoDB Timeseries unusably slow queries after deleting and re-creating timeseries with auto-deletion - mongodb

My measurements:
I have 3 measurement types, a lot of assets, and data with timestamps. The size of the timeseries grows by ~5GB / month after compression.
My timeseries collections:
I had a blank timeseries DB initially. It was slow so I indexed it, it became actually extremely fast and handy to use. After a few months, it became too large. I couldn't delete old entries in time, so I had to drop it. It became unusable.
I have created a new one, with automatic deletion. I indexed it in 2 ways.
.createIndex({measurement:1, asset:1, timestamp:1})
.createIndex({asset:1, measurement:1, timestamp:1})
Each and every query for a single asset's measurement type, limited to 1000 takes upwards of 2 minutes, if it ever returns at all.
Example:
{asset:"my_asset", measurement:"price"}
{measurement:"price", asset:"my_asset"}
The database is unusable on arrival. It was never functional, from the very start, with or without indexing the queries were unbearably slow.
The fastest query, which sometimes returns something, I found in MongoDB Compass. Using the indexes, just querying {asset:"my_asset"} with all the types and without limits, seems to be faster than querying into literally the easiest index to navigate, I don't know how on earth there could be a better index and how any query can take longer than a second. Literally, having it indexed by measurement type cuts off 66% of bilions of other entries initially, then asset: narrows down the asset and how can this take longer than a second?
This is the server's snippet code that makes the call.
let doc = doc!{
"measurement":measurement, //reverse order due to index measure//timest.
"asset":asset
};
let find_options = FindOptions::builder()
.projection(doc! {
"timestamp": 1,
"data": 1,
})
.limit(1000)
.build();
let res = coll.find(doc, find_options).await.unwrap();
Example of an op
{
"type": "op",
"host": "---",
"desc": "---",
"connectionId": 80806,
"client": "---",
"clientMetadata": {
"driver": {
"name": "mongo-rust-driver",
"version": "2.3.1"
},
"os": {
"type": "linux",
"architecture": "x86_64",
"version": "18.04"
},
"platform": "rustc 1.67.0-nightly (1eb62b123 2022-11-27) with tokio"
},
"active": "true",
"currentOpTime": "2023-02-18T16:09:00.569+00:00",
"effectiveUsers": [
{
"user": "---",
"db": "---"
}
],
"threaded": true,
"opid": 4359443,
"lsid": {
"id": "---",
"uid": "---"
},
"secs_running": 1255,
"microsecs_running": 1255159150,
"op": "getmore",
"ns": "MyDB.MyTS",
"command": {
"getMore": {
"low": 1168443057,
"high": 1271883998,
"unsigned": false
},
"collection": "MyTS",
"$db": "MyDB",
"lsid": {
"id": "---"
},
"$clusterTime": {
"clusterTime": {
"$timestamp": "7201527765789573142"
},
"signature": {
"hash": "---",
"keyId": {
"low": 6,
"high": 1669480924,
"unsigned": false
}
}
},
"$readPreference": {
"mode": "primaryPreferred"
}
},
"planSummary": "IXSCAN { control.min.measurement: 1, control.max.measurement: 1, control.min.asset: 1, control.max.asset: 1, control.min.timestamp: 1, control.max.timestamp: 1 }",
"cursor": {
"cursorId": {
"low": 1168443057,
"high": 1271883998,
"unsigned": false
},
"createdDate": "2023-02-18T16:04:56.238Z",
"lastAccessDate": "2023-02-18T16:05:47.456Z",
"nDocsReturned": 0,
"nBatchesReturned": 1,
"noCursorTimeout": false,
"tailable": false,
"awaitData": false,
"originatingCommand": {
"aggregate": "system.buckets.MyTS",
"pipeline": [
{
"$_internalUnpackBucket": {
"timeField": "timestamp",
"bucketMaxSpanSeconds": 86400,
"exclude": [],
"assumeNoMixedSchemaData": true
}
},
{
"$match": {
"measurement": "price",
"asset": "myAsset"
}
},
{
"$limit": 1000
},
{
"$project": {
"timestamp": 1,
"data": 1
}
}
],
"cursor": {
"batchSize": 101
},
"collation": {}
},
"operationUsingCursorId": 4359443
},
"numYields": 4224,
"locks": {},
"waitingForLock": "false",
"lockStats": {
"FeatureCompatibilityVersion": {
"acquireCount": {
"r": 4480
}
},
"Global": {
"acquireCount": {
"r": 4480
}
},
"Mutex": {
"acquireCount": {
"r": 256
}
}
},
"waitingForFlowControl": false,
"flowControlStats": {}
}
It returned after 32 minutes. It used to take 3 seconds, maximum.

Related

How to sum values using unwind in MongoDB with Spring Data

When using unwind("items") each item produces a duplicate line. This leads to revenue being counted as many times as there are items.
Example: Someone buys 1 item of A and 1 item of B, resulting in a cummulative value of 10. Unwind now inserts a row for each item, leading to a sum(cummulative) of 20.
I tried grouping the elements directly after unwind but have not managed to get it to work properly.
How can I sum each array element without duplication every other value?
I have this data structure
{
"_id": {
"$binary": "VEE6CsjHPvjzS2JYso7mnQ==",
"$type": "3"
},
"sequentialId": {
"$numberLong": "1"
},
"date": "2022-02-04",
"invoiceTotal": {
"$numberDecimal": "9.85"
},
"vatTotal": {
"$numberDecimal": "0"
},
"vatPercentage": {
"$numberDecimal": "19.00"
},
"invoiceNumber": "1111111",
"type": "ELEKTRONISCH",
"aktivkonto": 2800,
"passivkonto": 5200,
"buyerEmail": "",
"username": "",
"shop": "",
"externalId": "",
"shipped": false,
"actualShippingCost": {
"$numberDecimal": "1"
},
"filename": "",
"isReported": false,
"deliveryCostTotal": {
"$numberDecimal": "4.35"
},
"items": [
{
"lineItemId": "",
"amount": "1",
"sku": "A123123",
"title": "",
"priceTotal": {
"$numberDecimal": "4.50"
},
"vatTotal": {
"$numberDecimal": "0"
},
"hardwareCostPerPiece": {
"$numberDecimal": "0.22"
},
"hardwareCostTotal": {
"$numberDecimal": "0.22"
}
},
{
"lineItemId": "",
"amount": "1",
"sku": "B212312",
"title": "",
"priceTotal": {
"$numberDecimal": "1.00"
},
"vatTotal": {
"$numberDecimal": "0"
},
"hardwareCostPerPiece": {
"$numberDecimal": "0.22"
},
"hardwareCostTotal": {
"$numberDecimal": "0.22"
}
}
],
"packagingCost": {
"$numberDecimal": "0.15"
},
"hasInvoiceSent": false,
"tenant": "you!",
"createdAt": {
"$date": "2022-02-04T15:23:40.716Z"
},
"modifiedAt": {
"$date": "2022-02-04T15:23:40.716Z"
},
"_class": "_.RevenueEntity"
}
and this query
fun sumAllByWeeklyAndTenant(tenant: String): Flux<DashboardRevenue> {
val aggregation = newAggregation(
match(findByTenant(tenant)),
unwind("items"),
group("invoiceNumber")
.sum("items.hardwareCostTotal").`as`("hardwareCostTotal"),
project()
.andExpression("year(createdAt)").`as`("year")
.andExpression("week(createdAt)").`as`("week")
.andInclude(bind("hardwareCostTotal", "items.hardwareCostTotal"))
.andInclude(
"invoiceTotal",
"vatTotal",
"actualShippingCost",
"packagingCost",
"marketplaceFeesTotal",
"tenant"
),
group("year", "week", "tenant")
.sum("invoiceTotal").`as`("umsatz")
.sum("actualShippingCost").`as`("portokosten")
.sum("packagingCost").`as`("verpackung")
.sum("marketplaceFeesTotal").`as`("marketplaceFees")
.sum("hardwareCostTotal").`as`("hardwareCost")
.sum("vatTotal").`as`("vatTotal")
.count().`as`("numberOfInvoices"),
sort(Sort.Direction.DESC, "year", "week"),
limit(8),
sort(Sort.Direction.ASC, "year", "week")
)
return reactiveMongoTemplate
.aggregate(aggregation, "revenue", DashboardRevenue::class.java)
.toFlux()
}
Using the data above the query results in
[
{
"_id": {
"year": 2022,
"month": 2,
"week": 0,
"tenant": "einsupershop"
},
"umsatz": 19.70,
"portokosten": 2,
"verpackung": 0.30,
"marketplaceFees": 3.42,
"hardwareCost": 0.22,
"vatTotal": 0,
"numberOfInvoices": 2
}
]
Where the expected value is "invoiceTotal": { "$numberDecimal": "9.85" }

Mongo Aggregation using $Max

I have a collection that stores history, i.e. a new document is created every time a change is made to the data, I need to extract fields based on the max value of a date field, however my query keeps returning either all of the dates or requires me to push the fields into an array which make the data hard to analyze for an end-user.
Expected output as CSV:
MAX(DATE), docID, url, type
1579719200216, 12371, www.foodnetwork.com, food
1579719200216, 12371, www.cnn.com, news,
1579719200216, 12371, www.wikipedia.com, info
Sample Doc:
{
"document": {
"revenueGroup": "fn",
"metaDescription": "",
"metaData": {
"audit": {
"lastModified": 1312414124,
"clientId": ""
},
"entities": [],
"docId": 1313943,
"url": ""
},
"rootUrl": "",
"taggedImages": {
"totalSize": 1,
"list": [
{
"image": {
"objectId": "woman-reaching-for-basket",
"caption": "",
"url": "",
"height": 3840,
"width": 5760,
"owner": "Facebook",
"alt": "Woman reaching for basket"
},
"tags": {
"totalSize": 4,
"list": []
}
}
]
},
"title": "The 8 Best Food Items of 2020",
"socialTitle": "The 8 Best Food Items of 2020",
"primaryImage": {
"objectId": "woman-reaching-for-basket.jpg",
"caption": "",
"url": "",
"height": 3840,
"width": 5760,
"owner": "Hero Images / Getty Images",
"alt": "Woman reaching for basket in laundry room"
},
"subheading": "Reduce your footprint with these top-performing diets",
"citations": {
"list": []
},
"docId": 1313943,
"revisionId": "1313943_1579719200216",
"templateType": "LIST",
"documentState": {
"activeDate": 579719200166,
"state": "ACTIVE"
}
},
"url": "",
"items": {
"totalSize": "",
"list": [
{
"type": "recipe",
"data": {
"comInfo": {
"list": [
{
"type": "food",
"id": "https://www.foodnetwork.com"
}
]
},
"type": ""
},
"id": 4,
"uuid": "1313ida-qdad3-42c3-b41d-223q2eq2j"
},
{
"type": "recipe",
"data": {
"comInfo": {
"list": [
{
"type": "news",
"id": "https://www.cnn.com"
},
{
"type": "info",
"id": "https://www.wikipedia.com"
}
]
},
"type": "PRODUCT"
},
"id": 11,
"uuid": "318231jc-da12-4475-8994-283u130d32"
}
]
},
"vertical": "food"
}
Below query:
db.collection.aggregate([
{
$match: {
vertical: "food",
"document.documentState.state": "ACTIVE",
"document.templateType": "LIST"
}
},
{
$unwind: "$document.items"
},
{
$unwind: "$document.items.list"
},
{
$unwind: "$document.items.list.contents"
},
{
$unwind: "$document.items.list.contents.list"
},
{
$match: {
"document.items.list.contents.list.type": "recipe",
"document.revenueGroup": "fn"
}
},
{
$sort: {
"document.revisionId": -1
}
},
{
$group: {
_id: {
_id: {
docId: "$document.docId",
date: {$max: "$document.revisionId"}
},
url: "$document.items.list.contents.list.data.comInfo.list.id",
type: "$document.items.list.contents.list.data.comInfo.list.type"
}
}
},
{
$project: {
_id: 1
}
},
{
$sort: {
"document.items.list.contents.list.id": 1, "document.revisionId": -1
}
}
], {
allowDiskUse: true
})
First of all, you need to go through the documentation of the $group aggregation here.
you should be doing this instead:
{
$group: {
"_id": "$document.docId"
"date": {
$max: "$document.revisionId"
},
"url": {
$first: "$document.items.list.contents.list.data.comInfo.list.id"
},
"type": {
$first:"$document.items.list.contents.list.data.comInfo.list.type"
}
}
}
This will give you the required output.

Elasticsearch - query dates without a specified timezone

I have an index with the following mappings - standard format for a date. In the 2nd record below the time specified is actually a local time - but ES treats it as UTC.
Even though ES is internally converting all parsed datetimes to UTC but it must obviously store the original string as well.
My question is whether (and how) it might be possible to query all records for which the scheduledDT value doesn't have the timezone explicitly specified.
{
"curator_v3": {
"mappings": {
"published": {
"analyzer": "classic",
"numeric_detection": true,
"properties": {
"Id": {
"type": "string",
"index": "not_analyzed",
"include_in_all": false
},
"createDT": {
"type": "date",
"format": "dateOptionalTime",
"include_in_all": false
},
"scheduleDT": {
"type": "date",
"format": "dateOptionalTime",
"include_in_all": false
},
"title": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"raw": {
"type": "string",
"index": "not_analyzed"
},
"shingle": {
"type": "string",
"analyzer": "shingle"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
}
},
"include_in_all": false
}
}
}
}
}
}
We use .NET as our client to ElasticSearch and haven't been consistent in specifying a timezone for the scheduleDT field.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 12,
"successful": 12,
"failed": 0
},
"hits": {
"total": 32,
"max_score": null,
"hits": [
{
"_index": "curator_v3",
"_type": "published",
"_id": "29651227",
"_score": null,
"fields": {
"Id": [
"29651227"
],
"scheduleDT": [
"2015-11-21T22:17:51.0946798-06:00"
],
"title": [
"97 Year-Old Woman Cries Tears Of Joy After Finally Getting Her High School Diploma"
],
"createDT": [
"2015-11-21T22:13:32.3597142-06:00"
]
},
"sort": [
1448165871094
]
},
{
"_index": "curator_v3",
"_type": "published",
"_id": "210466413",
"_score": null,
"fields": {
"Id": [
"210466413"
],
"scheduleDT": [
"2015-11-22T12:00:00"
],
"title": [
"6 KC treats to bring to Thanksgiving"
],
"createDT": [
"2015-11-20T15:08:25.4282-06:00"
]
},
"sort": [
1448193600000
]
}
]
},
"aggregations": {
"ScheduleDT": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 27,
"buckets": [
{
"key": 1448165871094,
"key_as_string": "2015-11-22T04:17:51.094Z",
"doc_count": 1
},
{
"key": 1448193600000,
"key_as_string": "2015-11-22T12:00:00.000Z",
"doc_count": 4
}
]
}
}
}
You can do this by querying the document having a scheduleDT whose field length is less than 20 characters (e.g. 2015-11-22T12:00:00). All the date fields with a specified time zone would be longer.
Something like this should do:
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc.scheduleDT.value.size() < 20"
}
}
}
}
}
Note, however, that in order to make your queries easier to create you should always try to convert all your timestamps in UTC before indexing your documents.
Finally, also make sure that you have dynamic scripting enabled in order to run the above query.
UPDATE
Actually, if you use the _source directly in the script it will work because it will return the real value from the source as it was when the document was indexed:
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_source.scheduleDT.size() < 20"
}
}
}
}
}

Elastic Search: Any way to make space-separated words in a comma-separated list regarded as one term?

I don't know if this is possible, but I'm trying to search by locations with an "exact search" option. There are a couple fields that get searched, with the most important one being the "location_raw" field:
"match": {
"location.location_raw": {
"type": "boolean",
"operator": "AND",
"query": "[location query]",
"analyzer": "standard"
}
}
The location_raw field is a location string with a comma between each place, such as "Sudbury, Middlesex, Massachusetts" or "Leamington, Warwickshire, England". If someone searches for "Sudbury, Middlesex" it gets passed in as
"query": "Sudbury Middlesex"
and both of those terms must exist in the location_raw field. This part works.
The problem is that when the location_raw field contains multi-word location, like New York or Saint George, these get returned when someone searches for "York" or "George." If I do an exact search for "George," I do not want to get results for "Saint George." Is there any way to make Elastic consider "Saint George" one term in the string "Saint George, Stamford, Lincoln, England"?
Here's one way to do it, but you have to query in csv too, or use a terms filter.
I used a pattern analyzer with a simple pattern: ", ". I set up a simple index with a single document:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"csv": {
"type": "pattern",
"pattern": ", ",
"lowercase": false
}
}
}
},
"mappings": {
"doc": {
"properties": {
"location": {
"type": "string",
"index_analyzer": "csv",
"search_analyzer": "standard",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"location":"Saint George, Stamford, Lincoln, England"}
I can see the terms generated with a simple terms aggregation:
POST /test_index/_search?search_type=count
{
"aggs": {
"location_terms": {
"terms": {
"field": "location"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"location_terms": {
"buckets": [
{
"key": "England",
"doc_count": 1
},
{
"key": "Lincoln",
"doc_count": 1
},
{
"key": "Saint George",
"doc_count": 1
},
{
"key": "Stamford",
"doc_count": 1
}
]
}
}
}
And then if I query with the same csv syntax, the document isn't returned for "George, England":
POST /test_index/_search
{
"query": {
"match": {
"location": {
"type": "boolean",
"operator": "AND",
"query": "George, England",
"analyzer": "csv"
}
}
}
}
...
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
but is for "Saint George, England":
POST /test_index/_search
{
"query": {
"match": {
"location": {
"type": "boolean",
"operator": "AND",
"query": "Saint George, England",
"analyzer": "csv"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2169777,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.2169777,
"_source": {
"location": "Saint George, Stamford, Lincoln, England"
}
}
]
}
}
This query is equivalent, and probably more performant:
POST /test_index/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"terms": {
"location": [
"Saint George",
"England"
],
"execution": "and"
}
}
}
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/234ea93accb7b20ad8fd33e62fe92f1d450a51ab

ElasticSearch river from Mongo messing up field mappings

I'm using Mongo, Elastic Search and this river plugin: https://github.com/richardwilly98/elasticsearch-river-mongodb
I have successfully set everything up in that the river keeps the ES data updated when Mongo is updated, but the river is straight up copying all the properties from the Mongo documents into ES, but I only want a small sub-set of those records. E.g. if a Mongo doc has 30 properties all of them are getting put into ES instead of only the 5 that I want. I assume the issue is with the mappings, and I've followed several docs and another Stack Overflow thread (curl -X POST -d #mapping.json + mapping not created) but it still is not working for me. Here is what I'm doing:
I'm creating my index with:
curl -XPOST "http://localhost:9200/mongoindex" -d #index.json
index.json:
{
"settings" : {
"number_of_shards" : 1
},
"analysis" : {
"analyzer" : {
"str_search_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["lowercase"]
},
"str_index_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["lowercase", "ngram"]
}
},
"filter" : {
"ngram" : {
"type" : "ngram",
"min_gram" : 2,
"max_gram" : 20
}
}
}
}
Then running:
curl -XPOST "http://localhost:9200/mongoindex/listing/_mapping" -d #mapping.json
With this data:
{
"listing":{
"properties":{
"_all": {
"enabled": false
},
"title": {
"type": "string",
"store": false,
"index": "not_analyzed"
},
"bathrooms": {
"type": "integer",
"store": true,
"index": "analyzed"
},
"bedrooms": {
"type": "integer",
"store": true,
"index": "analyzed"
},
"address": {
"type": "nested",
"include_in_parent": true,
"store": true,
"properties": {
"counrty": {
"type":"string"
},
"city": {
"type":"string"
},
"stateOrProvince": {
"type":"string"
},
"fullStreetAddress": {
"type":"string"
},
"postalCode": {
"type":"string"
}
}
},
"location": {
"type": "geo_point",
"full_name": "geometry.coordiantes",
"store": true
}
}
}
}
Then finally creating the river with:
curl -XPUT "http://localhost:9200/_river/mongoindex/_meta" -d #river.json
river.json:
{
"type": "mongodb",
"mongodb": {
"db": "blueprint",
"collection": "Listing",
"options": {
"secondary_read_preference": true,
"drop_collection": true
}
},
"index": {
"name": "mongoindex",
"type": "listing"
}
}
After all that the river works in that ES is populated, but its a verbatim copy of Mongo right now, and I need to modify the mappings, but it just is not taking effect. What am I missing?
This is what my mapping looks like after the river runs.... nothing like what I want it to look like.
I would set dynamic mapping to false:
The dynamic creation of mappings for unmapped types can be completely
disabled by setting index.mapper.dynamic to false.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-dynamic-mapping.html
Others have had similar issues to yours and it looks like the best solution so far has been to prevent the MongoDB River from dynamically mapping at all:
https://github.com/richardwilly98/elasticsearch-river-mongodb/issues/75
Turns out the issue was that the dynamic property was left out of the mappings config. It should be in 2 places, on the index.json as shown above, and in the mappings.json:
{
"listing":{
"_source": {
"enabled": false
},
"dynamic": false, // <--- Need to add this
"properties":{
"_all": {
"enabled": false
},
"title": {
"type": "string",
"store": false,
"index": "str_index_analyzer"
},
"bathrooms": {
"type": "integer",
"store": true,
"index": "analyzed"
},
"bedrooms": {
"type": "integer",
"store": true,
"index": "analyzed"
},
"address": {
"type": "nested",
"include_in_parent": true,
"store": true,
"properties": {
"counrty": {
"type":"string",
"index": "str_index_analyzer"
},
"city": {
"type":"string",
"index": "str_index_analyzer"
},
"stateOrProvince": {
"type":"string",
"index": "str_index_analyzer"
},
"fullStreetAddress": {
"type":"string",
"index": "str_index_analyzer"
},
"postalCode": {
"type":"string"
}
}
},
"location": {
"type": "geo_point",
"full_name": "geometry.coordiantes",
"store": true
}
}
}
}
The 902 docs vs 451, I think that is an bug in the ElasticSearch Head plugin I'm using to browse documents. It doesn't have duplicates, but a couple of spots show 902 docs as a summary of sorts.