MongoDB $text search not returning expected results - mongodb

I created a text index based on title and main_body fields in my Mongo Collection. I have for instance an article with the title: "Abby Bengtsson" and her name "Abby", appearing throughout the actual article in main_body.
Making a text search query: {$text: {$search: 'abby bengtsson'}}, returns the desired article, along with a couple more.
But simply querying her first name: {$text: {$search: 'abby'}}, returns nothing.
I have tried using Mongo Compass, Downloaded Studio 3T, and using ssh and terminal commands on the server directly.
But I don't understand why this happens.. The same goes for other key words in other articles.
JSON Doc example
{
"_id" : ObjectId("5e0f4ded35fbd16f21bf3655"),
"category" : {
"category_id" : "5010",
"slug" : {
"0010" : "profiler",
"0020" : "profiler",
"0030" : "profiler"
},
"label" : {
"0010" : "Profiler",
"0020" : "Profiler",
"0030" : "Profiler"
},
"bg_color" : "#B12CA6",
"txt_color" : "#ffffff",
"main_category_id" : "5000"
},
"featured_image" : {
"main" : "https://img.norrbom.com/article/5e0f4d5e35fbd16f21bf3653/78805a221a988e79ef3f42d7c5bfd418-1578061277668/abby.jpg",
"mobile" : "https://img.norrbom.com/article/5e0f4d5e35fbd16f21bf3653/78805a221a988e79ef3f42d7c5bfd418-1578061277668/abby.jpg",
"square" : "https://img.norrbom.com/article/5e0f4d5e35fbd16f21bf3653/78805a221a988e79ef3f42d7c5bfd418-1578061277668/abby.jpg"
},
"metadata" : {
"title" : "Abby Bengtsson",
"description" : "Hon sprudlar av energi och glädje, vilket smittar av sig på hela redaktionen när hon kliver in hos En Sueco. Med sig har hon sin ursöta följeslagare pomeranianen Melwin",
"og" : {
"title" : "Abby Bengtsson",
"description" : "Hon sprudlar av energi och glädje, vilket smittar av sig på hela redaktionen när hon kliver in hos En Sueco. Med sig har hon sin ursöta följeslagare pomeranianen Melwin",
"image" : "https://img.norrbom.com/article/5e0f4d5e35fbd16f21bf3653/78805a221a988e79ef3f42d7c5bfd418-1578061277668/abby.jpg",
"type" : "article",
"site_name" : "En Sueco",
"url" : "https://www.ensueco.com/profil-abby-bengtsson"
},
"twitter" : {
"title" : "Abby Bengtsson",
"description" : "Hon sprudlar av energi och glädje, vilket smittar av sig på hela redaktionen när hon kliver in hos En Sueco. Med sig har hon sin ursöta följeslagare pomeranianen Melwin",
"card" : "summary",
"image" : "https://img.norrbom.com/article/5e0f4d5e35fbd16f21bf3653/78805a221a988e79ef3f42d7c5bfd418-1578061277668/abby.jpg"
}
},
"tags" : [
],
"title" : "Abby Bengtsson",
"state" : NumberInt(1),
"created" : ISODate("2020-01-01T04:17:00.000+0000"),
"modified" : ISODate("2020-01-01T08:27:54.000+0000"),
"version" : NumberInt(19),
"featured" : false,
"language" : "sv",
"magazines" : [
],
"slug" : "profil-abby-bengtsson",
"published" : ISODate("2020-01-02T10:14:00.000+0000"),
"published_until" : null,
"author_alias" : "Text: Sara Laine, sara#norrbom.com Foto: Mugge Fischer, mugge#norrbom.com",
"main_body" : "... stringified JSON object with article ...",
"article_id" : ObjectId("5e0f4d5e35fbd16f21bf3653"),
"origin" : "cms",
"site" : "0020",
"__v" : NumberInt(0)
}
EDIT 18-01-2020
I just tested something. It seems, that this issue only occurs for documents where the language property is set to sv (Swedish as per MongoDB Language Documentation). If I change the value to da (Danish), the document is being returned, when I search for "Abby".
I have currently solved my issue in production, by setting language_overwrite to a dummy field that doesn't exist.. Now all fields are being returned as they should. But the thing with the swedish language field still confuses me, as it is ONLY when I se the field to "sv" - and what sense does it make to have multiple language documents, and a text index that supposedly should return and search based on locale, if it doesn't work for one particular language variable?

What version of MongoDB are you using? The functionality has changed a bit version to version. See https://docs.mongodb.com/manual/core/index-text/#versions for more details.
I tested this out in 4.2 and got the results you would expect.
To test this out I created a free cluster in Atlas (cloud.mongodb.com) and loaded the sample data. Then I navigated to the Collections tab. The sample data contains a database named "sample_mflix" with a collection called "movies". My collection had a default text index that covered the following fields: cast_text_fullplot_text_genres_text_title_text.
Then I navigated to the Find tab. When I ran the searches you described, I got the results you would expect. Both {$text: {$search: 'abby bengtsson'}} and {$text: {$search: 'abby'}} return many results
Update based on new information added to original question on 18-01-20
I spoke with a colleague who explained to me what is going on:
It is worth noting that text search is designed for stemming with language heuristics. This will have unexpected outcomes with proper nouns like "Abby" (and with multi-language search).
Using query explain output for insight, this is what is happening:
- Abby stems to abby in Swedish but abbi in English, so the term is indexed as abby given the language value of sv in the document.
- A search without any language will default to English (rather than trying to stem in all possible languages) so a default search will not match the indexed term.
To search matching the indexed language they would have to provide a language value, eg: db.articles.find({$text: {$search: 'abby', $language: 'sv'}}).
This is working as designed but doesn't match the user's expectation that queries would be stemmed to match all possible languages (which is probably an unhelpful outcome in terms of relevance).
What they actually want is the solution they arrived at: they should index with a language of none for simple tokenisation without stemming or stopwords.

Related

MongoListener + Spring Detect updated fields in Document

I have a Springboot application + MongoDB and I need to audit every update made to a collection on specified fields (data analysis purpose).
If I have a collection like:
{
"_id" : ObjectId("12345678910"),
"label_1" : ObjectId("someIdForLabel1"),
"label_2" : ObjectId("someIdForLabel2"),
"label_3" : ObjectId("someIdForLabel"),
"name": "my data",
"description": "some curious stuff",
"updatedAt" : ISODate("2022-06-21T08:28:23.115Z")
}
I want to write an audit document whenever a label_* is updated. Something like
{
"_id" : ObjectId("111213141516"),
"modifiedDocument" : ObjectId("12345678910"),
"modifiedLabel" : "label_1",
"newValue" : ObjectId("someNewIdForLabel1"),
"updatedBy" : ObjectId("userId"),
"updatedAt" : ISODate("2022-06-21T08:31:20.315Z")
}
How can I achieve this with MongoListener? I already have two methods for AfterSave and AfterDelete , for other purposes, but they give me the whole new Document.
I would rather avoid to query again the DB or to use a findAndModify() in the first place.
I gave a look to ChangeStreams too, but I have too many doubts when it comes to more than 1 instance.
Thank you so much, any tip will be appreciated!

MongoDB: Find an element in a document that contains an object with a subobject with an array

I'm getting puzzled more and more discovering how mongodb is overcomplicated and bad designed in the query writing, anyway I have this kind of document in a db with thousand of records:
db.messages.aggregate([{$limit: 1}]).pretty()
{
"_id" : ObjectId("4f16fc97d1e2d32371003f42"),
"body" : "Hey Gillette,\n\nThe heat rate is going to depend on the type of fuel and the construction \ndate of the unit. Unfortunately, most of that info is proprietary. \n\nChris Gaskill is the head of our fundamentals group and he might be able to \nsupply you with some of the guidelines.\n\n-Bass\n\n\n \n\tEnron North America Corp.\n\t\n\tFrom: Lisa Gillette 04/05/2001 02:31 PM\n\t\n\nTo: Eric Bass/HOU/ECT#ECT\ncc: \nSubject: Power Generation Question\n\nHey Bass,\n\nI have a question and I am hoping you can help me. I am wanting to compile a \nlist of all the different types of power plants and their respective heat \nrates to determine some sort of generation ratio.\n\ni.e. Coal 4 mmbtu = 1 MW\n Simple Cycle 11 mmbtu = 1 MW\n\nPlease let me know if you can help me or point me to someone who can. Just \nFYI...Bryan suggested that I call you so blame him as you curse me under your \nbreath right now.\n\nThanks,\nLisa\n\n",
"filename" : "1045.",
"headers" : {
"Content-Transfer-Encoding" : "7bit",
"Content-Type" : "text/plain; charset=us-ascii",
"Date" : ISODate("2001-04-05T14:45:00Z"),
"From" : "eric.bass#enron.com",
"Message-ID" : "<2106897.1075854772243.JavaMail.evans#thyme>",
"Mime-Version" : "1.0",
"Subject" : "Re: Power Generation Question",
"To" : [
"lisa.gillette#enron.com"
],
"X-FileName" : "ebass.nsf",
"X-Folder" : "\\Eric_Bass_Jun2001\\Notes Folders\\Sent",
"X-From" : "Eric Bass",
"X-Origin" : "Bass-E",
"X-To" : "Lisa Gillette",
"X-bcc" : "",
"X-cc" : ""
},
"mailbox" : "bass-e",
"subFolder" : "sent"
}
And I need to find records from address X to address Y.
I managed to catch the "From" records with
db.messages.find({"headers.From": "eric.bass#enron.com"}).pretty().count()
But I can't get the To records (and I Need to get both togheter).
To query the "To" field I've tried:
db.messages.find({headers: {$elemMatch :{ "To": "lisa.gillette#enron.com"}}})
But it returns nothing
What am I missing?
Thanks
$elemMatch - To use this operator we need to give the array element and the matching operator, here in your case it should be like
db.messages.find({"headers.To": {$elemMatch :{$eq:"lisa.gillette#enron.com"}}})
$elemMatch is optimal to use when we have multiple queries to given for the array elements. If we are specifying only a single condition in the $elemMatch expression, we don't need to use $elemMatch, instead we can use find
db.messages.find({"headers.To": "lisa.gillette#enron.com"});

REST: Context between child and parent

Take the following URI's as an example:
/tracks
/tracks/:id
/playlists
/playlists/:id
/playlists/:id/tracks
I have a question about the last URI (/playlists/:id/tracks). How do I add extra information/context to the track objects in relation to it's parent playlist?
Examples of context:
Added time of the track to the playlist
Play count of the track within the playlist
Likes per track within the playlist
All tracks have a created timestamp, play count and likes on a global scale. So my question is how would this information be added to the response of the endpoint.
I've come up with following for now:
{
"title" : "harder better faster stronger",
"artist" : "daft punk",
"likes" : 234252,
"created_at" : "2012-10-03 09:57:04"
"play_count" : 1203200035,
"relation_to_parent": {
"likes" : 5,
"created_at" : "2014-11-07 19:21:64",
"play_count" : 20
}
}
I've added a field called relation_to_parent which adds some context to the relation between the child and it's parent. I'm not sure though if this is a good way to do it. Hope to hear some other solutions.
By 1:n relations you can define a subresource. By n:m relations it is better to define a separate relationship resource. Note that these are just best practices, not standards.
Be aware that you can add links pointing to a different resource. According to the HATEOAS constraint you have to create hyperlinks if you want to expose an operation (for example getting another resource).
I don't think there is a 'one true way' to do this. Personally, I dislike adding the extra information like that, since you are giving a resource-plus, when you are looking for a resource. In any case, are 'likes' and 'created_at' and 'play_count' actually part of the relation to the parent, aren't they part of the track itself?
The two paths I usually see for this are:
/playlist/:id/tracks - returns a list of IDs (or URLs) for actual tracks, which you then fetch with /tracks/:track
/playlist/:id/tracks - returns the actual tracks, as if you did both steps in 1 above.
As for additional information, if it is not part of the tracks, you might do it as (any of these is valid):
info as part of the track, so /tracks/:track always returns the 'play_count' and 'likes', etc.
separate information, i.e. its own resource, if you want to keep the track clean. So you might get it at /tracks/:track/social_info or maybe /social_info/:track where it matches the track ID 1-to-1
If you have actual relation information, then it depends if it is 1:1 or 1:N or N:1 or N:N. 1:1 or 1:N or N:1 you would probably reports as part of the resource itself, while N:N would either be part of the resource (JSON objects can have depth) or as a separate resource.
Personally, I have done all of the above, and find cleaner is better, even if it is multiple retrievals. But now we are delving into opinion....
EDITED:
There are lots of ways to do N:N, here are just some:
/playlist/:id/tracks/:track/social_info - which could be embedded or a link to another object
/social_info/:playlist - more direct
/social_info/playlist/:id if you might have different kinds of social info
Personally (there is that word again; so much of this is personal preference and opinion), every time I have tried using deeper paths, thinking something only makes sense in a parent context, I have found myself ending up making its own resource for it, and linking back, so the 2nd or 3rd option ends up being what I do, with the first linking to it (either convenience to retrieve it or retrieve a list of it).
Mostly, that has not been because of constraints on the server side - e.g. when I write in nodejs, I use http://github.com/deitch/booster which handles multiple paths to the same resource really easily - but because client side frameworks often work better with a one true path.
If you want to fully embrace RESTful service design principles you definitely want to use hyperlinks in your representation format. JSON has some existing specifications if you prefer not to come up with your own: HAL and JSON API. A naive hypermedia format might look like this:
{
"playlist_id" : "666",
"created_at" : "2014-11-07 19:21:64",
"likes" : 5,
"tracks" : [
{"index" : 1,
"begin_at" : "00:02:00",
"end_at" : "00:05:23",
"_links" : {"track" : {
"href" : "/tracks/123",
"type" : "track"}}},
{"index" : 2,
"_links" : {"track" : {
"href" : "/tracks/432",
"type" : "track"}}},
{"index" : 3,
"_links" : {"track" : {
"href" : "/tracks/324",
"type" : "track"}}},
{"index" : 4,
"_links" : {"track" : {
"href" : "/tracks/567",
"type" : "track"}}}]
}
More elaborate features are included in both HAL and JSON API, like defining embedded resources and link templates. Using such semantics you might end up with something like the following:
{
"id" : "666",
"created_at" : "2014-11-07 19:21:64",
"likes" : 5,
"tracks" : [
{"id" : "123",
"index" : 1,
"begin_at" : "00:02:00",
"end_at" : "00:05:23"},
{"id" : "432",
"index" : 2},
{"id" : "324",
"index" : 3},
{"id" : "567",
"index" : 4}
],
"_links" : {
"_self" : {
"href" : "/playlists/666",
"type" : "playlist"},
"tracks" : {
"href" : "/tracks/{id}",
"type" : "track"}
},
"_embedded" : {
"track" : [
{"id" : "123",
"title" : "harder better faster stronger",
"artist" : "daft punk",
"created_at" : "2012-10-03 09:57:04",
"likes" : 234252,
"play_count" : 1203200035},
{"id" : "432",
"title" : "aerodynamic",
"artist" : "daft punk",
"created_at" : "2009-03-07 11:11:11",
"likes" : 33056,
"play_count" : 8796539}
]
}
}
Also, don't forget that using hyperlinks to express static relationships between entities is just the beginning of the journey. Using Hypermedia As The Engine Of Application State is the real Nirvana... but then you might be aiming too high.

A no specific query using Mongo DB

I've downloaded a database yesterday consistent of tweets during the games in the confederation cup. And I saved it in the Mongo DB. My data model in the database is like the following json:
{ "_id" : ObjectId("51bc9036194069119ff88c10"),
"text" : "Adianta AGORA ir para ruas e protesta, como em Brasilia e outras capitais contra a copa das confederações e do... http://t.co/41e4GGoe4o",
"created_at" : "2013-06-15 16:03:02",
"id" : NumberLong("345934481724669954"),
"user" : { "image" : "http://a0.twimg.com/profile_images/3425436378/57edc83f19d834283351a3729595d480_normal.jpeg",
"screen_name" : "Fernando_Fontes",
"id" : 54433693,
"name" : "Fernando Fontes"
}
}
Now, I wish to retrieve tweets using a term from the field 'text', for example: 'Brasilia'. But, I couldn't make a query searching for part of the text. I'm still starting with NoSQL and Big Data things. Is there any way to find the documents which has a word inside the field 'text'?
Thanks a lot in advance,
Thiago

In MongoDB, how does on get the value in a field for an embedded document, but query based on a different value

I have a basic structure like this:
> db.users.findOne()
{
"_id" : ObjectId("4f384903cd087c6f720066d7"),
"current_sign_in_at" : ISODate("2012-02-12T23:19:31Z"),
"current_sign_in_ip" : "127.0.0.1",
"email" : "something#gmail.com",
"encrypted_password" : "$2a$10$fu9B3M/.Gmi8qe7pXtVCPu94mBVC.gn5DzmQXH.g5snHT4AJSZYCu",
"last_sign_in_at" : ISODate("2012-02-12T23:19:31Z"),
"last_sign_in_ip" : "127.0.0.1",
"name" : "Trip Jameson",
"sign_in_count" : 100,
"usertimes" : [
...thousands and thousands of records like this one....
{
"enddate" : 348268392.115282,
"idle" : 0,
"startdate" : 348268382.116728,
"title" : "My Awesome Title"
},
]
}
So I want to find only usertimes for a single user where the title was "My Awesome Title", and then I want to see what the value for "idle" was in that record(s)
So far all I can figure out is that I can find the entire user record with a search like:
> db.users.find({'usertimes.title':"My Awesome Title"})
This just returns the entire User record though, which is useless for my purposes. Am I misunderstanding something?
Return only partial embedded documents is currently not supported by MongoDB
The matching User record will always be returned (at least with the current MongoDB version).
see this question for similar reference
Filtering embedded documents in MongoDB
This is the correspondent Jira on MongoDB space
http://jira.mongodb.org/browse/SERVER-142
Use:
db.users.find({'usertimes.title': "My Awesome Title"}, {'idle': 1});
May I suggest you take a more detailed look at http://www.mongodb.org/display/DOCS/Querying, it'll explain things for you.