elasticsearch: special behaviour of _id field? - mongodb

I have some twitter data I want to work with. I want to be able to search for a name. When trying to generate ngrams of the 'name' and '_id' I run into some troubles.
first, I created the analyzers:
curl -XPUT 'localhost:9200/twitter_users' -d '
{
"settings": {
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20
}
}
}
}
}'
then I defined my mappings:
curl -XPUT 'http://localhost:9200/twitter_users/users/_mapping' -d '
{
"users": {
"type" : "object",
"properties": {
"_id": {
"type": "string",
"copy_to": "id"
},
"id": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer",
"index": "analyzed"
},
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer",
"index": "analyzed"
}
}
}
}
}
}'
and inserted some test data:
curl -XPUT "localhost:9200/twitter_users/users/johndoe" -d '{
"_id" : "johndoe",
"name" : "John Doe"
}'
curl -XPUT "localhost:9200/twitter_users/users/janedoe" -d '{
"_id" : "janedoe",
"name" : "Jane Doe"
}'
querying by name gets me the expected results:
curl -XPOST "http://localhost:9200/twitter_users/users/_search" -d '{
"query": {
"match": {
"name.ngrams": "doe"
}
}
}'
but querying on the id gives me no results:
curl -XPOST "http://localhost:9200/twitter_users/users/_search" -d '{
"query": {
"match": {
"id": "doe"
}
}
}'
I also tested to make _id a multi field like I did with name. But that didn't work either.
is _id behaving differently than other fields? Or am I doing something wrong here?
edit: using elasticsearch v1.1.2 and pulling the data from mongodb with a river plugin.
Thanks for your Help
Mirko

Looks like the 'copy_to' is the issue, but why not insert the 'id' values into the 'id' fields directly?
curl -XPUT "localhost:9200/twitter_users/users/johndoe" -d '{
"id" : "johndoe",
"name" : "John Doe"
}'
curl -XPUT "localhost:9200/twitter_users/users/janedoe" -d '{
"id" : "janedoe",
"name" : "Jane Doe"
}'

Related

Merge 2 JSON objects from 2 files using jq

I have two json files
1.json
{
"outputs": {
"item1": {
"name": "name1",
"email": "email1"
}
}
}
2.json
{
"outputs": {
"item2": {
"name": "name2",
"email": "email2"
}
}
}
I'm trying to merge them using jq
jq -s '{
"Items" :
{
"list" : .[] | .outputs ,
},
}' 1.json 2.json
and I get just two Items objects, but I want to have one Items object and all item* merged like this
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}
I've tried .[0] * .[1] trick, but I cannot put it into object construction.
How can do this with jq?
Try adding the --slurped array:
jq -s '{Items: {objects: map(.outputs) | add}}' 1.json 2.json
Demo
Another approach could be using reduce with inputs and the -n flag:
jq -n 'reduce inputs.outputs as $i ({}; .Items.objects += $i)' 1.json 2.json
Demo
Output:
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}

the result of mongo export is not a valid json

I use this command to export data from db, but it not a valid json, it doesn't have comma at the end of item, how to fix this issue?
https://docs.mongodb.com/database-tools/mongoexport/#syntax
{
"id" : "1",
"name" : "a"
}
{
"id" : "2",
"name" : "b"
}
{
"id": "3",
"name": "c"
}
{
"id": "4",
"name": "d"
}
Use the --jsonArray option (see https://docs.mongodb.com/database-tools/mongoexport/#std-option-mongoexport.--jsonArray )
mongoexport --quiet -d test -c test --pretty --jsonArray
[{
"_id": {
"$oid": "611aca090848cb8cab2943f7"
},
"id": "1",
"name": "a"
},
{
"_id": {
"$oid": "611aca110848cb8cab2943f8"
},
"id": "2",
"name": "b"
},
{
"_id": {
"$oid": "611aca180848cb8cab2943f9"
},
"id": "3",
"name": "c"
},
{
"_id": {
"$oid": "611aca1d0848cb8cab2943fa"
},
"id": "4",
"name": "d"
}]

Elasticsearch mongodb and river indexing fail

Dear my (I hope) saviors,
I have a very annoying problem with elasticsearch and mondogdb datas indexing.
The situation is like:
I have 2 mongodb collections, records and pois, that I need to index on elasticsearch, using river plugin (deprecated, I know).
(Record has a reference (DBRef) to poi and other collections, called otherref# here.)
Now, when I execute the curl call, it happens that....Sometimes all records document are indexed, sometimes just 200 (of 140k). Sometimes 900 pois documents are indexed, sometimes just 200 (never all, about 70k).
So, it seems that the script doesn't work properly.
I've monitored the /var/log/elasticsearch log , but no error has been logged.
Here the indexing script:
curl -XPUT "localhost:9200/lw_index_poi" -d '
{
"mappings": {
"lw_mapping_poi" : {
"properties" : {
"position" : {
"type" : "geo_shape"
},
"poi_id" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}'
curl -XPUT "localhost:9200/lw_index_record" -d '
{
"mappings": {
"lw_mapping_record" : {
"date_detection": false,
"properties" : {
"other_ref1" : {
"type" : "string",
"index" : "not_analyzed"
},
"other_ref2" : {
"type" : "string",
"index" : "not_analyzed"
},
"poi_ref" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}'
curl -XPUT "localhost:9200/_river/lw_index_poi/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"servers":
[
{ "host": "mongodb", "port": 27017 }
],
"options": {
"secondary_read_preference" : false
},
"db": "lifewatch",
"collection": "poi",
"script": "if (ctx.document.decimalLatitude && ctx.document.decimalLongitude) { ctx.document.position = {}; ctx.document.position.type=\"Point\"; ctx.document.position.coordinates = [ctx.document.decimalLongitude, ctx.document.decimalLatitude]; } ctx.document.poi_id = ctx.document._id; delete ctx.document.decimalLatitude; delete ctx.document.decimalLongitude;"
},
"index": {
"name": "lw_index_poi",
"type": "lw_mapping_poi"
}
}'
curl -XPUT "localhost:9200/_river/lw_index_record/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"servers":
[
{ "host": "mongodb", "port": 27017 }
],
"options": {
"secondary_read_preference" : false
},
"db": "lifewatch",
"collection": "record",
"script": "if (ctx.document.ref1) { ctx.document.ref1 = ctx.document.ref1.id; delete ctx.document.ref1;};if (ctx.document.poi) { ctx.document.poi_ref = ctx.document.poi.id; delete ctx.document.poi;};if (ctx.document.ref2) { ctx.document.ref2 = ctx.document.ref2.id; delete ctx.document.ref2;};"
},
"index": {
"name": "lw_index_record",
"type": "lw_mapping_record"
}
}'
What's wrong?
Thanks in advance

ElasticSearch autocomplete returning 0 hits

I am trying to build an autocomplete feature for our database running on MongoDB. We need to provide autocomplete which lets users complete their queries by offering suggestions while they are typing in the search box.
I have a collection of articles from various sources, which is having the following fields :
{
"title" : "Its the title of a random article",
"cont" : { "paragraphs" : [ .... ] },
and so on..
}
I went through a video by Clinton Gormley. From 37:00 through 42:00 minute, Gormley describes an autocomplete using edgeNGram. Also, I referred to this question to recognize that both are almost the same things, just the mappings differ.
So based on these experiences, I built almost identical settings and mapping and then restored articles collection to ensure that it is indexed by ElasticSearch
The indexing scheme is as follows:
POST /title_autocomplete/title
{
"settings": {
"analysis": {
"filter": {
"autocomplete": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
},
"analyzer": {
"title" : {
"type" : "standard",
"stopwords":[]
},
"autocomplete": {
"type" : "autocomplete",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
}
}
}
},
"mappings": {
"title": {
"type": "multi_field",
"fields" : {
"title" : {
"type": "string",
"analyzer": "title"
},
"autocomplete" : {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer" : "title"
}
}
}
}
}
But when I run the search query, I am unable to get any hits!
GET /title_autocomplete/title/_search
{
"query": {
"bool" : {
"must" : {
"match" : {
"title.autocomplete" : "Its the titl"
}
},
"should" : {
"match" : {
"title" : "Its the titl"
}
}
}
}
}
Can anybody please explain what's wrong with the mapping query or settings? I have been reading ElasticSearch docs for over 7 days now but seem to get nowhere more than full text searches!
ElastiSearch version : 0.90.10
MongoDB version : v2.4.9
using _river
Ubuntu 12.04 64bit
UPDATE
I realised that mapping is screwed after applying previous settings:
GET /title_autocomplete/_mapping
{
"title_autocomplete": {
"title": {
"properties": {
"analysis": {
"properties": {
"analyzer": {
"properties": {
"autocomplete": {
"properties": {
"filter": {
"type": "string"
},
"tokenizer": {
"type": "string"
},
"type": {
"type": "string"
}
}
},
"title": {
"properties": {
"type": {
"type": "string"
}
}
}
}
},
"filter": {
"properties": {
"autocomplete": {
"properties": {
"max_gram": {
"type": "long"
},
"min_gram": {
"type": "long"
},
"type": {
"type": "string"
}
}
}
}
}
}
},
"content": {
... paras and all ...
}
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
}
Analyzers and filters are actually mapped into the document after the settings are applied whereas original title field is not affected at all! Is this normal??
I guess this explains why the query is not matching. There is no title.autocomplete field or title.title field at all.
So how should I proceed now?
For those facing this problem, its better to delete the index and start again instead of wasting time with the _river just as DrTech pointed out in the comment.
This saves time but is not a solution. (Therefore not marking it as answer.)
The key is to set up the mappings and index before you initiate the river.
We had an existing setup with a mongodb river and an index called coresearch that we wanted to add autocomplete capacity to, this is the set of commands we used to delete the existing index and river and start again.
Stack is:
ElasticSearch 1.1.1
MongoDB 2.4.9
ElasticSearchMapperAttachments v2.0.0
ElasticSearchRiverMongoDb/2.0.0
Ubuntu 12.04.2 LTS
curl -XDELETE "localhost:9200/_river/node"
curl -XDELETE "localhost:9200/coresearch"
curl -XPUT "localhost:9200/coresearch" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}'
curl -XPUT "localhost:9200/coresearch/_mapping/users" -d '{
"users": {
"properties": {
"firstname": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"lastname": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"username": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"email": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
}
}
}
}'
curl -XPUT "localhost:9200/_river/node/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"servers": [
{ "host": "127.0.0.1", "port": 27017 }
],
"options":{
"exclude_fields": ["time"]
},
"db": "users",
"gridfs": false,
"options": {
"import_all_collections": true
}
},
"index": {
"name": "coresearch",
"type": "documents"
}
}'

Elastic search - tagging strength (nested/child document boosting)

Given the popular example of a post that has a collection of tags, let's say that we would want each tag to be more than a string but a tuple of a string and a double which signifies the strength of said tag.
How would one query posts and score these based on the sum of tag strengths (let's assume we are searching for exact terms in the tags names)
It can be done by indexing tags as nested documents and then using the nested query in combination with the custom score query. In the example below, the terms query finds matching tags, the custom score query uses values of the "wight" field of "tags" documents as scores and the nested query is using sum of these scores as the final score for the top level document.
curl -XDELETE 'http://localhost:9200/test-idx'
echo
curl -XPUT 'http://localhost:9200/test-idx' -d '{
"mappings": {
"doc": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string", "index": "not_analyzed" },
"weight": { "type": "float" }
}
}
}
}
}
}'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/1' -d '{
"title": "1",
"tags": [{
"tag": "A",
"weight": 1
}, {
"tag": "B",
"weight": 2
}, {
"tag": "C",
"weight": 4
}]
}
'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/2' -d '{
"title": "2",
"tags": [{
"tag": "B",
"weight": 2
}, {
"tag": "C",
"weight": 3
}]
}
'
echo
curl -XPUT 'http://localhost:9200/test-idx/doc/3' -d '{
"title": "3",
"tags": [{
"tag": "B",
"weight": 2
}, {
"tag": "D",
"weight": 4
}]
}
'
echo
curl -XPOST 'http://localhost:9200/test-idx/_refresh'
echo
# Example with custom script (slower but more flexable)
curl -XGET 'http://localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query" : {
"nested": {
"path": "tags",
"score_mode": "total",
"query": {
"custom_score": {
"query": {
"terms": {
"tag": ["A", "B", "D"],
"minimum_match" : 1
}
},
"script" : "doc['\''weight'\''].value"
}
}
}
},
"fields": []
}'
echo