MongoDB Schema Design for language database - mongodb

I need some advice on MongoDB schema design for a natural language database.
I need to store for each language texts and words like:
lang: {
_id: "English",
texts : [
{ text : "This is a first text",
date : Date("2011-09-19T04:00:10.112Z"),
tag : "test1"
},
{ text : "Second One",
date : Date("2011-09-19T04:00:10.112Z"),
tag : "test2"
}
],
words : [
{
word : "This",
},
{
word : "is",
},
{
word : "a",
},
{
word : "first",
},
{
word : "text",
},
{
word : "second",
},
{
word : "one",
}
]
}
And then I need to know each words and texts a user has associated. The word/text amount tends to be huge and I need to list all words on a language and all words a user has associated for that language.
From my perspective I think storing the user_ids that are associated with a given word in an array for the word is maybe a good approach like:
lang: {
_id: "English",
texts : [
...
],
words : [
{
word : "This",
users: [user1,user2,user3]
},
{
word : "is",
users: [user1,user2]
},
...
]
}
Having in mind that a word can be associated to hundreds of thousand of users and the document limit (as I read) is 4MB and that I need to:
List all words for a given user and language
Is this a good approach? Or can you think of a better one?
Hope this question is clear enough and that someone can give me a help on this ;)
Thank you all!

I don't think this is a good approach, for just the reason you mention: the document size limit. It looks like with your approach, you are definitely going to run up against the limit. I would go for a flatter approach (which should also make your collection easier to query). Something like this:
[
{
user: "user1",
word: "This",
lang: "en"
},
{
user: "user1",
word: "is",
lang: "en"
},
// et cetera...
]
In other words, grow vertically by adding documents rather than horizontally by adding more data to one document. You can query words for a given user with db.find( { user: "user1", lang: "en" });.
This approach isn't "normalized", of course, so if you're concerned about space then you might want to create a separate collection for users, words, and languages and reference them in the main collection by an ID. But since there are no join queries in MongoDB, you have to weigh query performance against space efficiency.

dbaseman is correct (and upvoted), but a couple of other points:
First, the document limit is now 16MB (Max Document Size), as of this writing, assuming you are running a recent versionof MongoDB.
Second, unbounded growth is generally a bad idea in MongoDB, this type of document size expansion can cause MongoDB to have to move the document if it exceeds the current space allocated to it. You can read more about this in the Padding Factor section of the documentation.
Those types of moves are relatively expensive, especially if they happen frequently. Therefore, if you do go with this type of design limiting the size (essentially bounding that growth) of the comments equivalent in your main collection (most recent X, most popular X etc.) and perhaps even pre-populating that document field (essentially manual padding) to beyond the average size will reduce the moves caused additions/changes.
This is the reason why tip #6 in the MongoDB Developers tips and tricks book from O'Reilly is:
Tip #6: Do not embed fields that have unbound growth

Related

search phrase or words in document with timestamped words

I've been trying to do this for some days, I guess it's time to ask for a little help.
I'm using elasticsearch 6.6 (I believe it could be upgraded if needed) and nest for c# net5.
The task is to create an index where the documents are the result of a speech-to-text recognition, where all the recognized words have a timestamp (so that that said timestamp can be used to find where the word is spoken in the original file). There are 1000+ texts from media files, and every file is 4 hours long (that means usually 5000~15000 words).
Main idea was to split every text in 3 sec long segments, creating a document with the words in that time segment, and index it so that it can be searched.
I thought that it would not work that well, so next idea was to create a document for every window of 10~12 words scanning the document and jumping by 2 words at time, so that the search could at least match a decent phrase, and have highlighting of the hits too.
Since it's yet far from perfect, I thought it would be nice to index every whole text as a document so to maintain its coherency, the problem is the timestamp associated with every word. To keep this relationship I tried to use nested objects in the document:
PUT index-tapes-nested
{
"mappings" : {
"_doc" : {
"properties" : {
"$type" : { "type" : "text" },
"ContentId" : { "type" : "long" },
"Inserted" : { "type" : "date" },
"TrackId" : { "type" : "long" },
"Words" : {
"type" : "nested",
"properties" : {
"StartMillisec" : { "type" : "integer" },
"Word": { "type" : "text" }
}
}
}
}
}
}
This kinda works, but I don't know exactly how to write the query to search in the index.
A very basic query could be for example:
GET index-tapes-nested/_search
{
"query":{
"nested":{
"path":"Words",
"score_mode":"avg",
"query":{
"match":{
"Words.Word": "a bunch of things"
}
},
"inner_hits": {}
}
}
}
but something like that, especially with the avg scoring, gives low quality results; there could be the right document in the hits, but it doesn't get the word order, so it's not certain and it's not clear.
So as far as I understand it the span_near should come handy in these situations, but I get no results:
GET index-tapes-nested/_search
{
"query": {
"nested":{
"path":"Words",
"score_mode": "avg",
"query": {
"span_near": {
"clauses": [
{ "span_term": { "Words.Word": "bunch" }},
{ "span_term": { "Words.Word": "of" }},
{ "span_term": { "Words.Word": "things" }}
],
"slop": 2,
"in_order": true
}
}
}
}
}
I don't know much about elasticsearch, maybe I should change approach and change the model, maybe rewriting the query is enough, I don't know, this is pretty time consuming, so any help is really appreciated (is this a fairly common task?). For the sake of brevity I'm cutting some stuff and some ideas, I'm available to give some data or other examples if needed.
I also had problems with the c# nest client to manage the nested index, but that is another story.
This could be interpreted in a few ways i guess, having something like an "alternative stream" for a field, or metadata for every word, and so on. What i needed was this: https://github.com/elastic/elasticsearch/issues/5736 but it's not yet done, so for now i think i'll go with the annotated_text plugin or the 10 words window.
I have no idea if in the case of indexing single words there can be a query that 'restores' the integrity of the original text (which means 1. grouping them by an id 2. ordering them) so that elasticsearch can give the desired results.
I'll keep searching in the docs if there's something interesting, or if i can hack something to get what i need (like require_field_match or intervals query).

Why should I have multiple collections in a NoSql database (MongoDB)

Example:
If I have a database, let's call it world, then I can have a collection of countries.
The collection can have documents, one example on a document:
"_id" : "hiqgfuywefowehfnqfdqfe",
"name" : "Italy",
cities : {{"_id": "ihevfbwyfv", name : "Napoli"}, {"_id: "hjbyiu", name: "Milano"}}
Why should I or when should I create a new collection of documents, I could expand my documents inside my world collection instead of creating a new collection.
Is creating a new collection not like creating a new domain?
Separate collections offer the greatest querying flexibility. Separate collections are good if you need to select individual documents, need more control over querying, or have huge documents and they require more work. Selecting embedded documents is more limited. A document, including all its embedded documents and arrays, cannot exceed 16 MB. Embedded documents are easy and fast, good when you want the entire document, the document with a $slice of comments, or with no comments at all.
A world collection should look like this:
You should create a document / country.
The above is a single document:
{
"_id": "objectKey", //generated by mongo
"countryName": "Italy",
"cities":[
{
"id": 1,
"cityName": "Napoli"
},
{
"id": 2,
"cityName": "Milano"
}
],
}
Try to expend this collection with other countrys but there are some other situations where you should consider creating a new collection and try to create some sort of relation:
If you use mmapv1 storage engine and you wish to add cities over the time you should consider relational way cuz mmapv1 does not tolerate rapidly growing arrays. If you create a document with mmapv1 the storage engine gives a paddign to the document for growing arrays. If this paddign is filled then the document will be relocated in the hard drive with will result disc fregmentation and performace loss. Btw the default storage is Wiredtiger now which does not have this problem.
Do not nest documents too deeply. It will make your documents hard to run complex queries on. For an example do not create documents like the following:
{
"continentName": "Europe",
"continentCountries: [
{
"countryName": "France",
"cities": [
{
"cityId": 1,
"cityName": "Paris"
},
{
another country
}
]
},
{
another country
}
],
"TimeZone": "GMT+1"
],
"continentId": 1
}
In this situtation you should create a continent property for your country object with the continent name or id in it.
With mongodb you should avoid relations most of the time but its not forbidden to create them. Its way better to create relation then nest document too deeply but the best solution is to flip hierarchical level of the document documents for an example:
use 'county->cities, countinentId, etc...' instead of 'continent->country->city'
useful reading: https://docs.mongodb.org/manual/core/data-modeling-introduction/

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

How can Mongo store infinitely long comments in a blog post example

Am looking to build a blogging system and came across the following blog.
http://blog.mongolab.com/2012/08/why-is-mongodb-wildly-popular/
While it's nice to see how we can store everything in one Mongo document as a json type object (example json from the blog pasted below) rather than distributing data across multiple tables, I'm having trouble understanding how this can accommodate an hypothetically super long comment thread.
{
_id: 1234,
author: { name: "Bob Davis", email : "bob#bob.com" },
post: "In these troubled times I like to …",
date: { $date: "2010-07-12 13:23UTC" },
location: [ -121.2322, 42.1223222 ],
rating: 2.2,
comments: [
{ user: "jgs32#hotmail.com",
upVotes: 22,
downVotes: 14,
text: "Great point! I agree" },
{ user: "holly.davidson#gmail.com",
upVotes: 421,
downVotes: 22,
text: "You are a moron" }
],
tags: [ "Politics", "Virginia" ]
}
Aside from the comments key which is represented as an array of comment objects, allowing us to store an endless number of comments within this document rather than on a separate comments table requiring a join operation to relate if we are to do this with a relational database, the rest of the fields (ie author, post, date, location, rating, tags) can all be done as columns on a relational database table as well.
Since there is a limit of 16MB per document, what happens when this blog attracts a lot of comments?
Also, why can't I store a json object on a relational database column? Afterall it's a text isn't it?
First, a clarification: MongoDB actually stores BSON, which is a essentially superset of JSON that supports more data types.
Since there is a limit of 16MB per document, what happens when this blog attracts a lot of comments?
You won't be able to increase the size past 16MB, so you'll lose the ability to add more comments. But you don't need to store all the comments on the blog post document. You could store the first N, then retire old comments to a comments collection as new ones are added. You could store comments in another collection with a parent reference. The way comments are stored should jive with how you expect them to be used. 16MB of comments would really be a lot - you might even have a special solution to handle the occasional post that gets that kind of activity, an approach that's totally different from the normal way of handling comments.
We can store json in a relational database. So what is the value of Mongo I'm getting?
Here's two ways of storing JSON (in MongoDB).
> db.test.drop()
> db.test.insert({ "name" : { "first" : "Yogi", "last" : "Bear" }, "location" : "Yellowstone", "likes" : ["picnic baskets", "PBJ", "the great outdoors"] })
> db.test.findOne()
{
"_id" : ObjectId("54f9f41f245e945635f2137b"),
"name" : {
"first" : "Yogi",
"last" : "Bear"
},
"location" : "Yellowstone",
"likes" : [
"picnic baskets",
"PBJ",
"the great outdoors"
]
}
var jsonstring = '{ "name" : { "first" : "Yogi", "last" : "Bear" }, "location" : "Yellowstone", "likes" : ["picnic baskets", "PBJ", "the great outdoors"] }'
> db.test.drop
> db.test2.insert({ "myjson" : jsonstring })
> db.test2.findOne()
{
"_id" : ObjectId("54f9f535245e945635f2137d"),
"myjson" : "{ \"name\" : { \"first\" : \"Yogi\", \"last\" : \"Bear\" }, \"location\" : \"Yellowstone\", \"likes\" : [\"picnic baskets\", \"PBJ\", \"the great outdoors\"] }"
}
Can you store and use JSON the first way using a relational database? How useful is JSON stored in the second way compared to the first?
There's lots of other differences between MongoDB and relational databases that make one better than the other for various use cases - but going further into that is too broad for an SO answer.
Can you store and use JSON the first way using a relational database?
How useful is JSON stored in the second way compared to the first?
Sorry are you suggesting that with Mongo json documents can be stored without using escape characters, whereas with a RDBMS I must use escape characters to escape the double quotes? I wasn't aware of that's the case.

Search full document in mongodb for a match

Is there a way to match a value with every array and sub document inside the document in mongodb collection and return the document
{
"_id" : "2000001956",
"trimline1" : "abc",
"trimline2" : "xyz",
"subtitle" : "www",
"image" : {
"large" : 0,
"small" : 0,
"tiled" : 0,
"cropped" : false
},
"Kytrr" : {
"count" : 0,
"assigned" : 0
}
}
for eg if in the above document I am searching for xyz or "ab" or "xy" or "z" or "0" this document should be returned.
I actually have to achieve this at the back end using C# driver but a mongo query would also help greatly.
Please advice.
Thanks
You could probably do this using '$where'
db.mycollection({$where:"JSON.stringify(this).indexOf('xyz')!=-1"})
I'm converting the whole record to a big string and then searching to see if your element is in the resulting string. Probably won't work if your xyz is in the fieldnames!
You can make it iterate through the fields to make a big string and then search it though.
This isn't the most elegant way and will involve a full tablescan. It will be faster if you look through the individual fields!
While Malcolm's answer above would work, when your collection gets large or you have high traffic, you'll see this fall over pretty quickly. This is because of 2 things. First, dropping down to javascript is a big deal and second, this will always be a full table scan because $where can't use an index.
MongoDB 2.6 introduced text indexing which is on by default (it was in beta in 2.4). With it, you can have a full text index on all the fields in the document. The documentation gives the following example where a text index is created for every field and names the index "TextIndex".
db.collection.ensureIndex(
{ "$**": "text" },
{ name: "TextIndex" }
)