I am looking to implement a tag search feature and was looking for some advice in terms of efficiency. I am new to MongoDB so I am unsure of best practices for performance.
Okay so I want to create a link sharing app which users tag the links based on their content. For instance a funny dog image would be tagged with "funny" and "dog". A link would have a:
title,
url,
user_id,
tags: array of tags
Now in order for me to allow users to search for links I need a list of all the tags used. For usability this needs to have auto-complete functionality. So I researched a bit and tested out using a collection of tags where I index the tag value e.g. "funny" and then use a regex.
db.tags.find({value:/^search/})
With a collection of 600,000 documents it searched for all documents beginning with "s" in 63 milliseconds. As the length of the search term increases the execution time decreases.
Now comes the part I'm unsure of. Say for instance I want to find all the links with have the tags "funny" and "dog" (need to use intersects). How should I store the tags? Should I store the object id of each tag? Can I index these object ids? Is there another way to structure the whole database?
Also id like to be able suggest tags based on tags they already entered. I was thinking of just having a related field in the tag document for instance:
tag
----
id
value
related: [{
tag_id
count
}]
(again unsure as it would suggest tags that could be related to one of the already entered tags and not to another. With an intersect this would return no results.)
Any advice would be much appreciated.
Edit: mistake
Create a text index on the tag array. This will enable you to search quickly for funny, dog, and funny or dog.
https://docs.mongodb.com/manual/core/index-text/
db.tags.createIndex( { tags: "text" }, {background:true} )
As to the related tags, I don't think that you want to reference the _id values. You can probably embed an array of related tags such as:
relatedTags: [{tag1}, {tag2}]
Related
So let's say I have the collection posts in the format:
_id
user
title
body
tags
likes
shares
date
I have a text index on title, body, tags, then just regular indexes on all the other fields.
What I want to achieve is for the user to be able to search the text fields and have the results sorted by either likes, shares or date (likes being the number of likes...and shares being the number of shares, it was possibly ambiguous)
Now, currently, find and sort are very fast on any field - surprisingly, the text fields can even be queried faster than the numeric fields - with something like db.posts.find({$text: {$search: "computer technology"}}).limit(20) returning in 0ms. Likewise, if I want to sort based on the likes field with db.posts.find().sort({likes: 1}).limit(20), it will also return in 0ms.
The problem is, however, that if I want a query that both finds and sorts on these fields db.posts.find({$text: {$search: "computer technology"}}).sort({likes: 1}).limit(20), then the query takes 8-9s to complete.
With this in mind, I was curious to see if adding a compound index like db.posts.createIndex({likes: 1, title: "text", body: "text",g: "text", tags: "text"},{name: "ltbt"}). Of course I realised this would be inefficient from a storage point of view, but I received the error message only one text index per collection allowed, found existing text index 'tbt' anyway, which I kind of expected might be the case. Likewise, this approach wouldn't really be viable...since even if you could have multiple text indexes, you'd need the compound index of the text fields with each of the numeric fields you'd want to search by.
So I'm just curious if I have missed something really obvious here or, if not, is there at least some way to improve performance?
I'll have reactjs webapp with nodejs server. Besides regular things like regular users profiles etc, i'll need to have the most effective and scalable solution for the purposes listed below. I understand that I'll need to rewrite some parts again and again over time, but DB choice is fundamental thing, so i hope i'll select the right one.
Full-text search. I'll have this json structure:
items: {
[guid]: { // txt
parent: [guid],
path: '/root/dir/subdir/',
created: timestamp,
updated: timestamp,
access: 'rwa', // rwa/rw-/r--/---
owner: user_name,
title: 'string',
text: 'string',
comments: {...}
},
},
Items will potentially contain millions of records. Each record's text property may contain from few words till, say, 100k characters. Users will be slightly updating and growing this records all the time.
I'll need to perform search based on title and text properties. I'll have to use path property to search among items with specific path. For example: "find first 20 records where title or text contains some words and path begins with /root/dir sorted by title/path/created/updated property"
Viewed-like statuses. Next thing i need is to mark each viewed page as "viewed" for user in order to gray out a url to it. Apparently, user might go through thousands of pages each time and there might be a lot of users. I guess "bloom filters" might help me with that, but i have no idea, YET, how i'll implement that.
News feeds. Well, that's nothing unusual, regular news feed with subscriptions, trandings and recommendations. I like the solution described here.
Thank you for help!
I'm toying with the Github search API (v3) and can't seem to find a description of the fields that are returned. Most of them are obvious, but there are a few like score that aren't. Does anyone know what score means, and does a field reference exist?
The score attribute is the search score of that document for a particular query, and is used for Best Match sorting. In other words, it's used for ranking search results, but it isn't shown in search results on github.com.
I am using Solr to index documents from a wiki. Each document has a unique id, title, body content and some other fields.
Firstly, I wanted to know the declaration in the Solr schema to store the multivalued field "tags" to hold n of these strings, attached to the document. Each document can have a set of tags applied on it.
Second, the use case - how exactly would I retrieve all distinct tags across the entire Solr instance with number of occurences, so that I can build a tag cloud of most popular tags?
thanks
Amit
I've got two document sets:
Wikis and WikiTags. Since i want flexible editing of tag names I don't want to embed tag itself into wiki document. So, I store a list of wiki_tag_ids inside wiki document.
I wonder what is the best way to find related tags using this schema. By related tags I mean tags that exist in other wikis tagged with selected tags.
May be I should store related tags in WikiTag document?
I suggest you should store to store WikiTag in Wiki document. Mongodb allow easy update, delete single document from nested collection, thats mean 'flexible editing of tag names'.
Collection like this:
wikis
{
_id,
wikiTags {_id, name, ...},
...
}
So, for example if you want update nested WikiTag name with id = SomeTagId you can:
db.wikis.update( {'wikiTags.id':SomeTagId},
{$set:{'wikiTags.$.name':"New Tag Name"}},
false,
true )
If yoy want delete item from nested array you should use $unset,
add new item: $push, $addToSet
So, i guess now you see that any operation with nested array can be done easy. And if performance is an issue -- use embedding.
Hope this helps.