Proper tree NoSQL structure with focus on full-text searching

Proper tree NoSQL structure with focus on full-text searching - mongodb

I developing an app with tree(folder-file) structure, on which I should perform full-text searches with MongoDB. I did a research on the best tree structure practices and found this great article, but I still can not decide which DB structure will fit my needs.
I have the following requirements in my mind:
I should be able to perform full-text search on individual folders, as well as everything from specific users
The folders/files should be shareable, so I need to be able to perform full-text search on all items accessible by specific user
I've been thinking about the following structures.
Structure 1
Fields of Users collection
1. _id - objectid
2. name - string
Fields of Folders collection
1. _id - objectid
2. name - string
3. owner - objectid
4. sharedWith - array of objectIds
5. location - objectid of parent folder, null if in root
6. createDate - datetime
Fields of File collection
1. _id - objectid
2. name - string
3. owner - objectid
4. sharedWith - array of objectIds
5. data - string
6. location - objectId of folder
7. createDate - datetime
So here comes my questions:
Should I use model tree structures with Parent References or Child References?
Should I use 1 collection for both files and folders(with type field) or I should separate them.
Does it worth to have only folder collection and nest documents in it.
This were my most important questions, thought I will greatly appreciate any advice on how I can improve the structure. I'm sorry if this isn't the right place to ask such questions.

It depends a bit on how far you want/need the setup to scale. For small numbers of files, folders, and files per folder it doesn't matter too much. That said,
I'd use references from children to parents. Parents (folders) may have hundreds or thousands of children (files and folders). This might be ok to store as references in one folder document, but most likely in that case you would want to index the array to support fast queries like "is file x in folder y?", and the array would be frequently changing. A large, frequently changing, indexed array is a recipe for bad performance in MongoDB. If you have only a couple hundred or so children per folder, you might be able to get away with storing references to all children in the parent, as long as you don't rely on that array being indexed for your queries. This essentially means you'd put a reference from the children to the parents to support the same queries.
I'd use one collection since you want to return both in response to many queries. Add a field to identify folders, like `folder : true or something.
No, it won't work to have folder documents with many nested layers. MongoDB in general doesn't support recursive or arbitrary-depth operations, making it difficult to work with such structures.

Related

MongoDB database design - contest application

I'm building a contest application. Which have 4 collections so far:
contest
questions
matches
users
I want to store every user score for every match he's assigned into. But I really can't find a proper way to achieve this.
All what I've came up with, Is to replace matches in users with an array in which each element contains a reference to matches collection and score field. But I think this is not very efficient.
EDIT
I was thinking about another solution. A separate collection called scores that contains three fields user, match and score.
Here's my schema structure:
Contests:
Questions:
Matches:
Users:
Note Any recommended adjustments on the current design is welcomed too.

Since mongodb is not designed to support collections relationships you migth end up with some duplicated work, I would suggest you to find a way of storing as much data as you can in a single document.
Your scores would go in each match document, probably the users array would have this structure {'users':[{user_id:'xxx',score:xxx}{user_id:'xxx',score:xxx}]}
The other solution, would be what you say, to have in each user doccument, a matches array with a structure like this: {'matches':[{match_id:'xxx',score:xxx}{match_id:'xxx',score:xxx}]}
You can have both also, this migth be more efficient depending the kind of queries you will need to do. You can also have a field in the subdocuments that stores the user/match name/title
Note: As you can see, you have two solutions, or you optimize for doccument size(so you can store more) or you optimize for performance (so you can read faster/with less resources)
Hope this be of any help.

Listing fields in a mongodb object

I am developing a system where items can be shared with other users via an access key. I'm storing the access keys as fields within a shareinfo object (embedded within the item's document), as shown below:
shareinfo:{
........
<nth key>: <permissions object - may be complex and large>
........
}
When an item is accessed I check shareinfo.key and find if its valid or not.
Currently, to list the keys I am loading (in Java) the entire shareinfo object in memory and running keySet() on it to retrieve and return the keys while the rest of the data is wasted.
Here's the problem: I want to get the list of keys (i.e. object field names) without the accompanying data (because in some cases the permissions object is noticeably large).
I could not find any query in the mongodb docs for such a query. I want to know whether it's possible or not? Or is there an optimized way to load the list of field names into the application without the accompanying field values?

I had the same issue when trying to understand the structure of an existing mongodb database using the mongodb shell. Then I found that I can use the JavaScript function Object.keys() to retrieve an array of the object fields. (Tested with MongoDb 2.4.2)
Object.keys(db.collection.findOne())

MongoDB has a schema-less design, which means that any document could have different fields contained within it from any other document. To get an exhaustive list of all the fields for all the documents, you would need to traverse all the documents in the collection and enumerate each field. For any reasonably sized collection this is going to be an extremely expensive operation.
There are a couple of helpers that make this more simple: Variety and schema.js . They both allow you to limit the documents inspected, and they report on coverage as well. Both amount to doing a map/reduce on the collections, but are at least simpler than having to write one yourself.

If you are just looking for top-level fields across all documents (don't care about sub-fields and all the details statistics provided by variety)
Here is the one-liner
db.things.aggregate([{$project: {arrayofkeyvalue: {$objectToArray: "$$ROOT"}}}, {$unwind:"$arrayofkeyvalue"}, {$group:{_id:null, allkeys:{$addToSet:"$arrayofkeyvalue.k"}}}])

NoSQL schema for folder structure

I have documents that represent a folder structure. A folder can contain other folders (nested), theoretically unlimited levels deep but more realistically 3 or 4 levels for our application. I need to be able to retrieve a single item (a node) and perhaps embedding will make this task a bit difficult?
Any suggestions?

The docs give a great summary of the more popular/common ways to store hierarchical data in mongodb.
Embedding documents - have significant drawbacks
Hard to search
Hard to get back partial results
Can get unwieldy if you need a huge tree. Further there is a limit on the size of documents in MongoDB – 16MB in v1.8 (limit may rise in future versions).
As you need to be able to retrieve single items - that is not likely to be the best option for your use case.
Array of ancestors or materialized path are likely to be much more suitable for what you've described - you could choose to use the full filepath for _id since that is unique and the path you would want to find data by more commonly.

MongoDB embedded documents vs. referencing by unique ObjectIds for a system user profile

I'd like to code a web app where most of the sections are dependent on the user profile (for example different to-do lists per person etc) and I'd love to use MongoDB. I was thinking of creating about 10 embedabble documents for the main profile document and keep everything related to one user inside his own document.
I don't see a clear way of using foreign keys for mongodb, the only way would be to create a field to_do_id with the type of ObjectId for example, but they would be totally unrelated internally, just happen to have the same Ids I'd have to query for.
Is there a limit on the number of embedded document types inside a top level document that could degrade performance?
How do you guys solve the issue of having a central profile document that most of the documents have to relate to in presenting a view per person?
Do you use semi foreign keys inside MongoDb and have fields with ObjectId types that would have some other document's unique Id instead of embedding them?
I cannot feel what approach should be taken when. Thank you very much!

There is no special limit with respect to performance. The larger the document, though, the longer it takes to transmit over the wire. The whole document is always retrieved.
I do it with references. You can choose between simple manual references and the database DBRef as per this page: http://www.mongodb.org/display/DOCS/Database+References
The link above documents how to have references in a document in a semi-foreign key way. The DBRef might be good for what you are trying to do, but the simple manual reference is very efficient.
I am not sure a general rule of thumb exists for which reference approach is best. Since I use Java or Groovy mostly, I like the fact that I get a DBRef object returned. I can check for this datatype and use that to decide how to handle the reference in a generic way.
So I tend to use a simple manual reference for references to different documents in the same collection, and a DBRef for references across collections.
I hope that helps.

Zend: index generation and the pros and cons of Zend_Search_Lucene

I've never came across an app/class like Zend Search Lucene before, as I've always queried my database.
Zend_Search_Lucene operates with
documents as atomic objects for
indexing. A document is divided into
named fields, and fields have content
that can be searched.
A document is represented by the
Zend_Search_Lucene_Document class, and
this objects of this class contain
instances of Zend_Search_Lucene_Field
that represent the fields on the
document.
It is important to note that any
information can be added to the index.
Application-specific information or
metadata can be stored in the document
fields, and later retrieved with the
document during search.
So this is basically saying that I can apply this to anything including databases, the key thing here is making indexes for searching.
What I'm trying to grasp is where exactly should I store the indexes in my application, let's take for example we have phones stored in a database, a manufacturers, models - how should I categorize the indexes?
If I'm making indexes of users with say, addresses I obviously wouldn't want them to be publically viewable, I'm just confused on how it all works out together, if there are known disadvantages, any gotchas I should know while using it.

A Lucene index is stored outside the database. I'd store it in a "data" directory as a sister to your controllers, models, and views. But you can store it anywhere; you just need to specify the path when you open the index for querying.
It's basically a redundant copy of the documents stored in your database, and you have to keep them in sync yourself. That's one of the disadvantages: you have to write code to populate the Lucene index based on results of a query against your database. As you add data to the database, you have to update your Lucene index as well.
An advantage of using an external full-text index solution is that you can reduce the workload on your RDBMS. To find a document, you execute a search using the Lucene API. The result should include a field containing the primary key value (as part of the document but no need to make it analyzed for FT search). You get this field back when you do a Lucene search, so you can look up the respective row in the database.
Does that help answer your question?
I gave a presentation recently for MySQL University comparing full-text search solutions:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
I also publish my slides at http://www.SlideShare.net/billkarwin.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse