Elasticsearch date range aggregation over multiple fields

Elasticsearch date range aggregation over multiple fields - date

My documents' structure look like this:
{
"element": "A",
"date": "2014-01-01",
"valid_until": "2014-02-01"
},
{
"element": "A",
"date": "2014-02-01",
"valid_until": "9999-12-31"
}
The date "9999-12-31" is here to say: "it has not yet expired". There is always range like this, so for a given element "A", date > valid_until can never overlaps. I can therefore count how much element I have by using the pseudo-code like this: COUNT elements WHERE date < date_to_count AND valid_until >= date_to_count
Where "date_to_count" is the date at which I want to count the values for. As I want to calculate this at several points in time, I could either use a date histogram, or a date range aggregation. However, the date range does seem to work only with one kind of field. Ideally, I'd like to be able to do that:
"aggs": {
"foo": {
"date_range": {
"fields": ["date", "valid_until"],
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
}
}
}
Where the "date" will be used for "from", and the "valid_until" would be used for "to".
I've tried several other ideas with script, but can't find an efficient way to do it this way :/.
I think I could also workaround this if, in a script, I could have access to the current from/to values, but once again, I've tried things like "ctx.to", "context.to", but those variables are undefined.
Thanks!

Since both the date_range and date_histogram aggregations work on a single field, I do not think you can achieve your goal with an aggregation. But if you don't have too many date ranges that you need to query for, you could call the count API with a query for each date range. That would look something like this:
"query": {
"filtered": {
"filter": {
"bool" {
"must": [
{ "range": { "date": { "gte": "2014-01-01" }}},
{ "range": { "valid_until": { "lt": "2014-02-01" }}}
]
}
}
}
}

I was facing the same problem, and wanted to address this by using one single query. Here is the solution that works for me in Elasticsearch 5.2
"aggs": {
"range1": {
"date_range": {
"fields": "date",
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
},
"range2": {
"date_range": {
"field": "valid_until",
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
}
}
}

Related

What is the best way to query an array of subdocument in MongoDB?

let's say I have a collection like so:
{
"id": "2902-48239-42389-83294",
"data": {
"location": [
{
"country": "Italy",
"city": "Rome"
}
],
"time": [
{
"timestamp": "1626298659",
"data":"2020-12-24 09:42:30"
}
],
"details": [
{
"timestamp": "1626298659",
"data": {
"url": "https://example.com",
"name": "John Doe",
"email": "john#doe.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "https://www.myexample.com",
"name": "John Doe",
"email": "doe#john.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "http://example.com/sub/directory",
"name": "John Doe",
"email": "doe#johnson.com"
}
}
]
}
}
Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:
db.info.find({"data.details.data.url": "example.com"})
How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"
Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:
{ $unwind: "$data.details" },
{
$match: {
"data.details.data.url": /.*example.com.*/,
},
},
{
$project: {
id: 1,
"data.details.data.url": 1,
"data.details.data.email": 1,
},
},
I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.
More info:
My collection has dozens of GB, I created two indexes:
Compound like so:
"data.details.data.url": 1,
"data.details.data.email": 1
Text Index:
"data.details.data.url": "text",
"data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!

Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.
To avoid this you have to use another regex, for example that just start with that domain.
For example:
(http[s]?:\/\/|www\.)YOUR_SEARCH
will check that what you are searching for is behind an http:// or www. marks.
https://regex101.com/r/M4OLw1/1
I leave you the full query.
[
{
'$unwind': {
'path': '$data.details'
}
}, {
'$match': {
'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
}
}
]
Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.

How to filter OData collection where attribute does not exist?

I have an OData collection where the data looks like this:
{
"#odata.context": "http://localhost:5488/odata/$metadata#folders",
"value": [
{
"name": "samples",
"_id": "79a91bc9-9083-4442-ac8d-ad30777ac8c8",
"creationDate": "2019-08-05T04:39:00.670Z",
"modificationDate": "2019-08-05T04:39:00.670Z",
"shortid": "18xQnNv"
},
{
"name": "Population",
"folder": {
"shortid": "18xQnNv"
},
"_id": "7406269b-669c-41ce-92f3-f540792df07e",
"creationDate": "2019-08-05T04:39:00.750Z",
"modificationDate": "2019-08-05T04:39:00.750Z",
"shortid": "0ppeLV"
},
{
"name": "Invoice",
"folder": {
"shortid": "18xQnNv"
},
"_id": "525aff6a-6b10-4ad6-93ce-e9c753e8ade0",
"creationDate": "2019-08-05T04:39:00.790Z",
"modificationDate": "2019-08-05T04:39:00.790Z",
"shortid": "G3i2B3"
},
{
"name": "Default",
"_id": "58daf5aa-1f13-4ff9-be1f-8cb11a812485",
"creationDate": "2019-08-07T22:56:45.160Z",
"modificationDate": "2019-08-07T22:56:45.160Z",
"shortid": "Sm8LpmP"
}
]
}
I want to exclude the objects which have the attribute "folder". I've tried using a GET request: http://localhost:5488/odata/folders?$filter=folder eq null with no luck. Is this even possible and is there a way to filter my request like this?

You might be able to use the all lambda operator to accomplish this. The operator all will always return true on empty collections. So if you make a condition that no folder attribute that actually exists will ever evaluate to true on, then the result should be a filter of only those objects that have an empty attribute.
This is just a theory. You'll need to test, but it would maybe look something like this on your sample.
http://localhost:5488/odata/folders?$filter=folder/all(f:f/shortid eq 'xxxxxx')
You didn't mention the version of OData your working with but lambda expressions are at least V4 and later. Possibly earlier, not sure.

How to perform date arithmetic between nested and unnested dates in Elasticsearch?

Consider the following Elasticsearch (v5.4) object (an "award" doc type):
{
"name": "Gold 1000",
"date": "2017-06-01T16:43:00.000+00:00",
"recipient": {
"name": "James Conroy",
"date_of_birth": "1991-05-30"
}
}
The mapping type for both award.date and award.recipient.date_of_birth is "date".
I want to perform a range aggregation to get a list of the age ranges of the recipients of this award ("Under 18", "18-24", "24-30", "30+"), at the time of their award. I tried the following aggregation query:
{
"size": 0,
"query": {"match_all": {}},
"aggs": {
"recipients": {
"nested": {
"path": "recipient"
},
"aggs": {
"age_ranges": {
"range": {
"script": {
"inline": "doc['date'].date - doc['recipient.date_of_birth'].date"
},
"keyed": true,
"ranges": [{
"key": "Under 18",
"from": 0,
"to": 18
}, {
"key": "18-24",
"from": 18,
"to": 24
}, {
"key": "24-30",
"from": 24,
"to": 30
}, {
"key": "30+",
"from": 30,
"to": 100
}]
}
}
}
}
}
}
Problem 1
But I get the following error due to the comparison of dates in the script portion:
Cannot apply [-] operation to types [org.joda.time.DateTime] and [org.joda.time.MutableDateTime].
The DateTime object is the award.date field, and the MutableDateTime object is the award.recipient.date_of_birth field. I've tried doing something like doc['recipient.date_of_birth'].date.toDateTime() (which doesn't work despite the Joda docs claiming that MutableDateTime has this method inherited from a parent class). I've also tried doing something further like this:
"script": "ChronoUnit.YEARS.between(doc['date'].date, doc['recipient.date_of_birth'].date)"
Which sadly also doesn't work :(
Problem 2
I notice if I do this:
"aggs": {
"recipients": {
"nested": {
"path": "recipient"
},
"aggs": {
"award_years": {
"terms": {
"script": {
"inline": "doc['date'].date.year"
}
}
}
}
}
}
I get 1970 with a doc_count that happens to equal the total number of docs in ES. This leads me to believe that accessing a property outside of the nested object simply does not work and gives me back some default like the epoch datetime. And if I do the opposite (aggregating dates of birth without nesting), I get the exact same thing for all the dates of birth instead (1970, epoch datetime). So how can I compare those two dates?
I am racking my brain here, and I feel like there's some clever solution that is just beyond my current expertise with Elasticsearch. Help!
If you want to set up a quick environment for this to help me out, here is some curl goodness:
curl -XDELETE http://localhost:9200/joelinux
curl -XPUT http://localhost:9200/joelinux -d "{\"mappings\": {\"award\": {\"properties\": {\"name\": {\"type\": \"string\"}, \"date\": {\"type\": \"date\", \"format\": \"yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ\"}, \"recipient\": {\"type\": \"nested\", \"properties\": {\"name\": {\"type\": \"string\"}, \"date_of_birth\": {\"type\": \"date\", \"format\": \"yyyy-MM-dd\"}}}}}}}"
curl -XPUT http://localhost:9200/joelinux/award/1 -d '{"name": "Gold 1000", "date": "2016-06-01T16:43:00.000000+00:00", "recipient": {"name": "James Conroy", "date_of_birth": "1991-05-30"}}'
curl -XPUT http://localhost:9200/joelinux/award/2 -d '{"name": "Gold 1000", "date": "2017-02-28T13:36:00.000000+00:00", "recipient": {"name": "Martin McNealy", "date_of_birth": "1983-01-20"}}'
That should give you a "joelinux" index with two "award" docs to test this out ("James Conroy" and "Martin McNealy"). Thanks in advance!

Unfortunately, you can't access nested and non-nested fields within the same context. As a workaround, you can change your mapping to automatically copy date from nested document to root context using copy_to option:
{
"mappings": {
"award": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"date": {
"type": "date"
},
"date_of_birth": {
"type": "date" // will be automatically filled when indexing documents
},
"recipient": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"date_of_birth": {
"type": "date",
"copy_to": "date_of_birth" // copy value to root document
}
},
"type": "nested"
}
}
}
}
}
After that you can access date of birth using path date, though the calculations to get number of years between dates are slightly tricky:
Period.between(LocalDate.ofEpochDay(doc['date_of_birth'].date.getMillis() / 86400000L), LocalDate.ofEpochDay(doc['date'].date.getMillis() / 86400000L)).getYears()
Here I convert original JodaTime date objects to system.time.LocalDate objects:
Get number of milliseconds from 1970-01-01
Convert to number of days from 1970-01-01 by dividing it to 86400000L (number of ms in one day)
Convert to LocalDate object
Create date-based Period object from two dates
Get number of years between two dates.
So, the final aggregation query looks like this:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"age_ranges": {
"range": {
"script": {
"inline": "Period.between(LocalDate.ofEpochDay(doc['date_of_birth'].date.getMillis() / 86400000L), LocalDate.ofEpochDay(doc['date'].date.getMillis() / 86400000L)).getYears()"
},
"keyed": true,
"ranges": [
{
"key": "Under 18",
"from": 0,
"to": 18
},
{
"key": "18-24",
"from": 18,
"to": 24
},
{
"key": "24-30",
"from": 24,
"to": 30
},
{
"key": "30+",
"from": 30,
"to": 100
}
]
}
}
}
}

compare two collections in mongodb using java or an simple query

I am having following document (Json) of an gallery,
{
"_id": "53698b6092x3875407fefe7c",
"status": "active",
"colors": [
"red",
"green"
],
"paintings": [
{
"name": "MonaLisa",
"by": "LeonardodaVinci"
},
{
"name": "JungleArc",
"by": "RayBurggraf"
}
]
}
Now I am also having one collection of colors say
COLORS-COLLECTION: ["black","yellow","red","green","blue","pink"]
I want to fetch paintings by it's name matching to provided text say "MonaLisa" (as search query) also I want to compare two colors with COLORS-COLLECTION, if colors has any of the matching color in COLORS-COLLECTION then it should return the painting.
I want something like below:
{
"paintings": [
{
"name": "MonaLisa",
"by": "LeonardodaVinci"
}
]
}
Please help me!!. Thanks in advance.

If I get you correctly, aggregation framework would do your job:
db.gallery.aggregate([
{"$unwind": "$paintings"},
{"$match": {"paintings.name": 'MonaLisa', "colors": {"$in": ["black","yellow","red","green","blue","pink"]}}},
{"$project": {"paintings": 1, "_id": 0}}
]);

Querying Multi Level Nested fields on Elastic Search

I'm new to Elastic Search and to the non-SQL paradigm.
I've been following ES tutorial, but there is one thing I couldn't put to work.
In the following code (I'me using PyES to interact with ES) I create a single document, with a nested field (subjects), that contains another nested field (concepts).
from pyes import *
conn = ES('127.0.0.1:9200') # Use HTTP
# Delete and Create a new index.
conn.indices.delete_index("documents-index")
conn.create_index("documents-index")
# Create a single document.
document = {
"docid": 123456789,
"title": "This is the doc title.",
"description": "This is the doc description.",
"datepublished": 2005,
"author": ["Joe", "John", "Charles"],
"subjects": [{
"subjectname": 'subject1',
"subjectid": [210, 311, 1012, 784, 568],
"subjectkey": 2,
"concepts": [
{"name": "concept1", "score": 75},
{"name": "concept2", "score": 55}
]
},
{
"subjectname": 'subject2',
"subjectid": [111, 300, 141, 457, 748],
"subjectkey": 0,
"concepts": [
{"name": "concept3", "score": 88},
{"name": "concept4", "score": 55},
{"name": "concept5", "score": 66}
]
}],
}
# Define the nested elements.
mapping1 = {
'subjects': {
'type': 'nested'
}
}
mapping2 = {
'concepts': {
'type': 'nested'
}
}
conn.put_mapping("document", {'properties': mapping1}, ["documents-index"])
conn.put_mapping("subjects", {'properties': mapping2}, ["documents-index"])
# Insert document in 'documents-index' index.
conn.index(document, "documents-index", "document", 1)
# Refresh connection to make queries.
conn.refresh()
I'm able to query subjects nested field:
query1 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.subjectname": "subject1"}
},
{
"range": {"subjects.subjectkey": {"gt": 1}}
}
]
}
}
}
}
results = conn.search(query=query1)
for r in results:
print r # as expected, it returns the entire document.
but I can't figure out how to query based on concepts nested field.
ES documentation refers that
Multi level nesting is automatically supported, and detected,
resulting in an inner nested query to automatically match the relevant
nesting level (and not root) if it exists within another nested query.
So, I tryed to build a query with the following format:
query2 = {
"nested": {
"path": "concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"concepts.name": "concept1"}
},
{
"range": {"concepts.score": {"gt": 0}}
}
]
}
}
}
}
which returned 0 results.
I can't figure out what is missing and I haven't found any example with queries based on two levels of nesting.

Ok, after trying a tone of combinations, I finally got it using the following query:
query3 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
}
]
}
}
}
}
So, the nested path attribute (subjects) is always the same, no matter the nested attribute level, and in the query definition I used the attribute's full path (subject.concepts.name).

Shot in the dark since I haven't tried this personally, but have you tried the fully qualified path to Concepts?
query2 = {
"nested": {
"path": "subjects.concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
},
{
"range": {"subjects.concepts.score": {"gt": 0}}
}
]
}
}
}
}

I have some question for JCJS's answer. why your mapping shouldn't like this?
mapping = {
"subjects": {
"type": "nested",
"properties": {
"concepts": {
"type": "nested"
}
}
}
}
I try to define two type-mapping maybe doesn't work, but be a flatten data; I think we should nested in nested properties..
At last... if we use this mapping nested query should like this...
{
"query": {
"nested": {
"path": "subjects.concepts",
"query": {
"term": {
"name": {
"value": "concept1"
}
}
}
}
}
}
It's vital for using full path for path attribute...but not for term key can be full-path or relative-path.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Elasticsearch date range aggregation over multiple fields - date

Related

What is the best way to query an array of subdocument in MongoDB?

How to filter OData collection where attribute does not exist?

How to perform date arithmetic between nested and unnested dates in Elasticsearch?

compare two collections in mongodb using java or an simple query

Querying Multi Level Nested fields on Elastic Search

Categories

Resources