Elastic search - tagging strength (nested/child document boosting) - nosql

Given the popular example of a post that has a collection of tags, let's say that we would want each tag to be more than a string but a tuple of a string and a double which signifies the strength of said tag.
How would one query posts and score these based on the sum of tag strengths (let's assume we are searching for exact terms in the tags names)

It can be done by indexing tags as nested documents and then using the nested query in combination with the custom score query. In the example below, the terms query finds matching tags, the custom score query uses values of the "wight" field of "tags" documents as scores and the nested query is using sum of these scores as the final score for the top level document.
curl -XDELETE 'http://localhost:9200/test-idx'
echo
curl -XPUT 'http://localhost:9200/test-idx' -d '{
"mappings": {
"doc": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string", "index": "not_analyzed" },
"weight": { "type": "float" }
}
}
}
}
}
}'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/1' -d '{
"title": "1",
"tags": [{
"tag": "A",
"weight": 1
}, {
"tag": "B",
"weight": 2
}, {
"tag": "C",
"weight": 4
}]
}
'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/2' -d '{
"title": "2",
"tags": [{
"tag": "B",
"weight": 2
}, {
"tag": "C",
"weight": 3
}]
}
'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/3' -d '{
"title": "3",
"tags": [{
"tag": "B",
"weight": 2
}, {
"tag": "D",
"weight": 4
}]
}
'
echo
curl -XPOST 'http://localhost:9200/test-idx/_refresh'
echo
# Example with custom script (slower but more flexable)
curl -XGET 'http://localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query" : {
"nested": {
"path": "tags",
"score_mode": "total",
"query": {
"custom_score": {
"query": {
"terms": {
"tag": ["A", "B", "D"],
"minimum_match" : 1
}
},
"script" : "doc['\''weight'\''].value"
}
}
}
},
"fields": []
}'
echo

Related

Merge 2 JSON objects from 2 files using jq

I have two json files
1.json
{
"outputs": {
"item1": {
"name": "name1",
"email": "email1"
}
}
}
2.json
{
"outputs": {
"item2": {
"name": "name2",
"email": "email2"
}
}
}
I'm trying to merge them using jq
jq -s '{
"Items" :
{
"list" : .[] | .outputs ,
},
}' 1.json 2.json
and I get just two Items objects, but I want to have one Items object and all item* merged like this
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}
I've tried .[0] * .[1] trick, but I cannot put it into object construction.
How can do this with jq?
Try adding the --slurped array:
jq -s '{Items: {objects: map(.outputs) | add}}' 1.json 2.json
Demo
Another approach could be using reduce with inputs and the -n flag:
jq -n 'reduce inputs.outputs as $i ({}; .Items.objects += $i)' 1.json 2.json
Demo
Output:
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}

the result of mongo export is not a valid json

I use this command to export data from db, but it not a valid json, it doesn't have comma at the end of item, how to fix this issue?
https://docs.mongodb.com/database-tools/mongoexport/#syntax
{
"id" : "1",
"name" : "a"
}
{
"id" : "2",
"name" : "b"
}
{
"id": "3",
"name": "c"
}
{
"id": "4",
"name": "d"
}
Use the --jsonArray option (see https://docs.mongodb.com/database-tools/mongoexport/#std-option-mongoexport.--jsonArray )
mongoexport --quiet -d test -c test --pretty --jsonArray
[{
"_id": {
"$oid": "611aca090848cb8cab2943f7"
},
"id": "1",
"name": "a"
},
{
"_id": {
"$oid": "611aca110848cb8cab2943f8"
},
"id": "2",
"name": "b"
},
{
"_id": {
"$oid": "611aca180848cb8cab2943f9"
},
"id": "3",
"name": "c"
},
{
"_id": {
"$oid": "611aca1d0848cb8cab2943fa"
},
"id": "4",
"name": "d"
}]

mongodb - filter collection by string array contains ""

For the below document, I want to write mongodb query to get the result.
[{
"id": "1",
"class": "class1",
"value": "xyz"
}, {
"id": "2",
"class": "class2",
"value": "abc"
}, {
"id": "3",
"class": "class3",
"value": "123"
}, {
"id": "4",
"class": "class4"
}, {
"id": "5",
"class": "class5",
"value": ""
}
]
The search parameter is an array of values - ["abc", "xyz", ""] and this is
going to look attribute "value"
The output should be below and in this case, the third item in the search array "" is pointing to collection that has "id" - 4 and 5 :
[{
"id": "1",
"class": "class1",
"value": "xyz"
}, {
"id": "2",
"class": "class2",
"value": "abc"
}, {
"id": "4",
"class": "class4"
}, {
"id": "5",
"class": "class5",
"value": ""
}
]
Please assist to provide the mongodb query to get the result like this
Whenever you have blank string you can add null in array, like this,
db.collection.find({
value: {
$in: ["abc", "xyz", "", null]
}
})

What is the proper way to create a vertex with a set property in Bluemix Graph DB?

I am trying to create a new vertex in the Bluemix Graph DB service. The schema of my DB is as follows.
{"propertyKeys":[{"name":"name","dataType":"String","cardinality":"SINGLE"},{"name":"languages","dataType":"String","cardinality":"SET"},{"name":"picture","dataType":"String","cardinality":"SINGLE"},{"name":"preferred_language","dataType":"String","cardinality":"SINGLE"},{"name":"bytes","dataType":"Integer","cardinality":"SINGLE"},{"name":"github_id","dataType":"String","cardinality":"SINGLE"},{"name":"twitter_id","dataType":"String","cardinality":"SINGLE"},{"name":"language_percentage","dataType":"Float","cardinality":"SINGLE"}],"vertexLabels":[{"name":"person"},{"name":"language"}],"edgeLabels":[{"name":"codes_in","multiplicity":"MULTI"},{"name":"used_by","multiplicity":"MULTI"}],"vertexIndexes":[{"name":"vByName","propertyKeys":["name"],"composite":true,"unique":false},{"name":"vByPreferredLang","propertyKeys":["preferred_language"],"composite":true,"unique":false},{"name":"vByLanguages","propertyKeys":["languages"],"composite":false,"unique":false}],"edgeIndexes":[{"name":"eByName","propertyKeys":["name"],"composite":true,"unique":false},{"name":"eByLanguagePercentage","propertyKeys":["language_percentage"],"composite":true,"unique":false}]}
I am trying to create the vertex with the following POST body
{"name":"Bob","languages":["Node","Python"],"picture":"https://en.gravatar.com/userimage/12148147/46ccae88e5aae747d53e0b1863f72a4e.jpg?size=200","preferred_language":"Node","github_id":"Bob","twitter_id":"Bob"}
However this results in the following error
{"code":"BadRequestError","message":"Property 'languages' with meta properties need to have a 'val'"}
The languages property has a cardinality of SET, what is the right way to create a property for a SET dataType? I would have assumed it was a JSON array.
Ryan, SET isn't a data type. You could also make languages a string with delimited values.
The only types that are supported in the Beta release are : String,Integer,Boolean,Float
The issue is that you're attempting to create a single vertex property with a data type of List<String>, which is not supported in IBM Graph (only JSON-primitive types are supported). To take advantage of a property with a SET data type you'll need to create multiple vertex properties.
It turns out that the distinction between cardinalities and data types in TinkerPop can be a bit of a confusing. Here's an example that should clarify things:
$ curl https://ibmgraph/11/g/schema -XPOST -Hcontent-type:application/json -d '{"propertyKeys":[{"name":"languages","dataType":"String","cardinality":"SET"}]}' | jq .
{
"requestId": "9e0ea947-f9a1-407b-ab1a-cd9b7fd5d561",
"status": {
"message": "",
"code": 200,
"attributes": {}
},
"result": {
"data": [
{
"propertyKeys": [
{
"name": "languages",
"dataType": "String",
"cardinality": "SET"
}
],
"vertexLabels": [],
"edgeLabels": [],
"vertexIndexes": [],
"edgeIndexes": []
}
],
"meta": {}
}
}
$ curl https://ibmgraph/11/g/vertices -XPOST | jq .
{
"requestId": "2ce85907-2aca-4630-876f-31775e74e1de",
"status": {
"message": "",
"code": 200,
"attributes": {}
},
"result": {
"data": [
{
"id": 4112,
"label": "vertex",
"type": "vertex",
"properties": {}
}
],
"meta": {}
}
}
$ curl https://ibmgraph/11/g/vertices/4112 -XPOST -Hcontent-type:application/json -d '{"languages":"Node"}' | jq .
{
"requestId": "52ad6d49-46c9-41aa-9928-5a567099d773",
"status": {
"message": "",
"code": 200,
"attributes": {}
},
"result": {
"data": [
{
"id": 4112,
"label": "vertex",
"type": "vertex",
"properties": {
"languages": [
{
"id": "si-368-sl",
"value": "Node"
}
]
}
}
],
"meta": {}
}
}
$ curl https://ibmgraph/11/g/vertices/4112 -XPOST -Hcontent-type:application/json -d '{"languages":"Python"}' | jq .
{
"requestId": "19886949-6328-4e19-8cac-8fdab37ef2a5",
"status": {
"message": "",
"code": 200,
"attributes": {}
},
"result": {
"data": [
{
"id": 4112,
"label": "vertex",
"type": "vertex",
"properties": {
"languages": [
{
"id": "si-368-sl",
"value": "Node"
},
{
"id": "16q-368-sl",
"value": "Python"
}
]
}
}
],
"meta": {}
}
}

elasticsearch: special behaviour of _id field?

I have some twitter data I want to work with. I want to be able to search for a name. When trying to generate ngrams of the 'name' and '_id' I run into some troubles.
first, I created the analyzers:
curl -XPUT 'localhost:9200/twitter_users' -d '
{
"settings": {
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20
}
}
}
}
}'
then I defined my mappings:
curl -XPUT 'http://localhost:9200/twitter_users/users/_mapping' -d '
{
"users": {
"type" : "object",
"properties": {
"_id": {
"type": "string",
"copy_to": "id"
},
"id": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer",
"index": "analyzed"
},
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer",
"index": "analyzed"
}
}
}
}
}
}'
and inserted some test data:
curl -XPUT "localhost:9200/twitter_users/users/johndoe" -d '{
"_id" : "johndoe",
"name" : "John Doe"
}'
curl -XPUT "localhost:9200/twitter_users/users/janedoe" -d '{
"_id" : "janedoe",
"name" : "Jane Doe"
}'
querying by name gets me the expected results:
curl -XPOST "http://localhost:9200/twitter_users/users/_search" -d '{
"query": {
"match": {
"name.ngrams": "doe"
}
}
}'
but querying on the id gives me no results:
curl -XPOST "http://localhost:9200/twitter_users/users/_search" -d '{
"query": {
"match": {
"id": "doe"
}
}
}'
I also tested to make _id a multi field like I did with name. But that didn't work either.
is _id behaving differently than other fields? Or am I doing something wrong here?
edit: using elasticsearch v1.1.2 and pulling the data from mongodb with a river plugin.
Thanks for your Help
Mirko
Looks like the 'copy_to' is the issue, but why not insert the 'id' values into the 'id' fields directly?
curl -XPUT "localhost:9200/twitter_users/users/johndoe" -d '{
"id" : "johndoe",
"name" : "John Doe"
}'
curl -XPUT "localhost:9200/twitter_users/users/janedoe" -d '{
"id" : "janedoe",
"name" : "Jane Doe"
}'