Fetching esJsonRDD from elasticsearch with complex filtering in Spark - scala

I am currently fetching the elasticsearch RDD in our Spark Job filtering based on one-line elastic query as such (example):
val elasticRdds = sparkContext.esJsonRDD(esIndex, s"?default_operator=AND&q=director.name:DAVID + \n movie.name:SEVEN")
Now if our search query becomes complex like:
{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "director.name:DAVID + \n movie.name:SEVEN"
}
},
"filter": {
"nested": {
"path": "movieStatus.boxoffice.status",
"query": {
"bool": {
"must": [
{
"match": {
"movieStatus.boxoffice.status.rating": "A"
}
},
{
"match": {
"movieStatus.boxoffice.status.oscar": "false"
}
}
]
}
}
}
}
}
}
}
Can I still convert that query to in-line elastic query to use it with esJsonRDD? Or is there anyway that the above query could still be used as is with esJsonRDD?
If not, what is the better way to fetch such RDDs in Spark?
Because esJsonRDD seems to accept only inline(one line) elastic queries.

Use triple quotes:
val query = """{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "director.name:DAVID + \n movie.name:SEVEN"
}
},
"filter": {
"nested": {
"path": "movieStatus.boxoffice.status",
"query": {
"bool": {
"must": [
{
"match": {
"movieStatus.boxoffice.status.rating": "A"
}
},
{
"match": {
"movieStatus.boxoffice.status.oscar": "false"
}
}
]
}
}
}
}
}
}
}"""
val elasticRdds = sparkContext.esJsonRDD(esIndex, query)

Related

Cannot find # in OpenSearch query

I have an index that includes a field and when a '#' is input, I cannot get the query to find the #.
Field Data: "#3213939"
Query:
GET /invoices/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"referenceNumber": {
"query": "#32"
}
}
},
{
"wildcard": {
"referenceNumber": {
"value": "*#32*"
}
}
}
]
}
}
}
"#" character drops during standard text analyzer this is why you can't find it.
POST _analyze
{
"text": ["#3213939"]
}
Response:
{
"tokens": [
{
"token": "3213939",
"start_offset": 1,
"end_offset": 8,
"type": "<NUM>",
"position": 0
}
]
}
You can update the analyzer and customize it.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html
OR
you can use referenceNumber.keyword field.
GET test_invoices/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"referenceNumber": {
"query": "#32"
}
}
},
{
"wildcard": {
"referenceNumber.keyword": {
"value": "*#32*"
}
}
}
]
}
}
}

Get inserted document counts in specific date range using date histogram in elasticsearch

I have list documents in elasticsearch which contains various fileds.
documents looks like below.
{
"role": "api_user",
"apikey": "key1"
"data":{},
"#timestamp": "2021-10-06T16:47:13.555Z"
},
{
"role": "api_user",
"apikey": "key1"
"data":{},
"#timestamp": "2021-10-06T18:00:00.555Z"
},
{
"role": "api_user",
"apikey": "key1"
"data":{},
"#timestamp": "2021-10-07T13:47:13.555Z"
}
]
I wanted to find the number of documents present in specifi date range with 1day interval, let's say
2021-10-05T00:47:13.555Z to 2021-10-08T00:13:13.555Z
I am trying the below aggregation for the result.
{
"size": 0,
"query": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2021-10-05T00:47:13.555Z",
"lte": "2021-10-08T00:13:13.555Z",
"format": "strict_date_optional_time"
}
}
}
]
}
}
},
"aggs": {
"data": {
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "day"
}
}
}
}
The expected output should be:-
For 2021-10-06 I should get 2 documents and 2021-10-07 I should get 1 document and if the docs are not present I should get count as 0.
the below solution works
{
"size":0,
"query":{
"bool":{
"must":[
],
"filter":[
{
"match_all":{
}
},
{
"range":{
"#timestamp":{
"gte":"2021-10-05T00:47:13.555Z",
"lte":"2021-10-08T00:13:13.555Z",
"format":"strict_date_optional_time"
}
}
}
],
"should":[
],
"must_not":[
]
}
},
"aggs":{
"data":{
"date_histogram":{
"field":"#timestamp",
"fixed_interval":"12h",
"time_zone":"Asia/Calcutta",
"min_doc_count":1
}
}
}
}

JsonTransformation using Jolt

I'm using jolt + java(https://github.com/bazaarvoice/jolt) to transform an external JSON in a format that I can understand.
My problem is the structure keeps changing and this is making my spec more and more complex.
I want to extract all the fields which are called "path" no matter the structure.
does someone have an idea how can I do that?
Example of structure:
{
"groups": {
"rows": {
"fieldSets": {
"fields": [{
"path": "example"
}]
}
}
}
}
or
{
"groups": {
"rows": {
"rowsets": {
"fieldSets": {
"fields": [{
"path": "example"
}]
}
}
}
}
}
or
{
"groups": {
"fieldSets": {
"fields": [{
"path": "example"
}]
}
}
}
in the end, I just want an array with plain "path" values.
I am also new to the JOLT. Anyway I tried with what I understood from your question. Just try the below spec:
[
{
"operation": "shift",
"spec": {
"groups": {
"rows": {
"fieldSets": {
"fields": {
"*": {
"path": "path"
}
}
}
}
}
}
}
]
If it is not what you are expecting, then please give an output json, so that i can understood what you want.

Query if key exists in an ElasticSearch hash

How do I check if my query terms are the keys in one of my fields? For example, here's a stored document:
{
field1: "some value",
field2: "some other value",
field3: {
something: [1,2],
else: [2,3]
}
}
The query "something" should return that document. The query "some value" should also return that document. Here's what I have so far:
{
query: {
filtered: {
query: {
multi_match: {
query: query,
fields: ['field1', 'field2'],
operator: 'and'
}
},
filter: {
or: [
{
exists: { field: "field3"}
}
]
}
}
}
}
Assuming you want "some value" , in adjesent fashion , following should work fine -
{
"query": {
"filtered": {
"filter": {
"exists": {
"field": "field3"
}
},
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match_phrase": {
"field1": "some value"
}
},
{
"match_phrase": {
"field2": "some value"
}
}
]
}
},
{
"multi_match": {
"query": "something",
"fields": [
"field1",
"field2"
],
"operator": "and"
}
}
]
}
}
}
}
}
{
query: {
filtered: {
filter: {
or: [
{
query: {
multi_match: {
query: query,
fields: ['field1', 'field2'],
operator: "and"
}
}
},
{
exists: { field: "field3.query" }
}
]
}
}
}
}
The only caveat is that if query is a string with multiple terms (or an array), you'll have to create an exists filter for each term.

Elasticsearch aggs: how to set the 'from' param?

Elasticsearch aggregation: How can I set the 'from' parameter, not just the size, for the result of an aggregation?
Do you mean like this:
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{ "range": { "timestamp": { "from": "now-180d", "to": "now" } } }
]
}
}
}
}
}