Mongodb Indexing Issue - mongodb

I have a collection in which below is the data:
"sel_att" : {
"Technical Specifications" : {
"In Sales Package" : "Charger, Handset, User Manual, Extra Ear Buds, USB Cable, Headset",
"Warranty" : "1 year manufacturer warranty for Phone and 6 months warranty for in the box accessories"
},
"General Features" : {
"Brand" : "Sony",
"Model" : "Xperia Z",
"Form" : "Bar",
"SIM Size" : "Micro SIM",
"SIM Type" : "Single Sim, GSM",
"Touch Screen" : "Yes, Capacitive",
"Business Features" : "Document Viewer, Pushmail (Mail for Exchange, ActiveSync)",
"Call Features" : "Conference Call, Hands Free, Loudspeaker, Call Divert",
"Product Color" : "Black"
},
"Platform/Software" : {
"Operating Frequency" : "GSM - 850, 900, 1800, 1900; UMTS - 2100",
"Operating System" : "Android v4.1 (Jelly Bean), Upgradable to v4.4 (KitKat)",
"Processor" : "1.5 GHz Qualcomm Snapdragon S4 Pro, Quad Core",
"Graphics" : "Adreno 320"
}
}
The data mentioned above is too huge and the fields are all dynamically inserted, how can I index such fields to get faster results?

It seems to me that you have not fully understood the power of document based databases such as MongoDB.
Bellow are just a few thoughts:
you have 1 million records
you have 1 million index values for that collection
you have to RAM available to store 1 million index values in-memory, otherwise the benefits of indexing would not be so keen to show up
yes you can have sharding but you need lots of hardware to accommodate basic needs
What you for sure need is something that can make dynamically link random text to valuable indexes and that allows you to search in vast amounts of text very fast. And for that you should use a tool like ElasticSearch.
Note that you can and should store your content in a NoSQL database and yes MongoDB is a viable option. And for the indexing part ElasticSearch has plugins available to enhance the communication between the two.
P.S. If I recall correctly the plugin is called MongoDB River
EDIT:
I've also added a more comprehensive definition for ElasticSearch. I won't take credit for it since I've grabbed it from Wikipedia:
Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a
RESTful web interface and schema-free JSON documents
EDIT 2:
I've scaled down a bit on the numbers since it might be far-fetched for most projects. But the main idea remains the same. Indexes are not recommended for the use-case described in the question.

Based on what you want to query, you will end up indexing those fields. You can also have secondary indexes in MongoDB. But beware creating too many indexes may improve your query performance but consume additional disk space and make inserts slower due to re-indexing.
MongoDB indexes

Short answer: you can't. Use Elastic Search.
Here is a good tutorial to setup MongoDB River on Elastic Search
The reason is simple, MongoDB does not work like that. It helps you store complex schemaless sets of documents. But you cannot index dozens of different fields and hope to get good performance. Generally a max of 5-6 indices are recommended per collection.
Elastic Search is commonly used in the fashion described above in many other use-cases, so it is an established pattern. For example, Titan Graph DB has the built-in option to use ES for this purpose. If I were you, I would just use that and would not try to make MongoDB do something it is not built to do.
If you have the time and if your data structure lends itself to (I think it might from the json above), then you could also use rdbms to break down these pieces and store them on-the-fly with an EAV like pattern. Elastic Search would be easier to start and probably easier to achieve performance quickly.

Well, there are lots of problems w.r.t having many indexes and has been discussed here. But if at all you need to add indexes for dynamic fields you actually create index from you mongo db driver.
So, lets say if you are using the Mongodb JAVA driver then you could create an index like below: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/#creating-an-index
coll.createIndex(new BasicDBObject("i", 1)); // create index on "i", ascending
PYTHON
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.create_index
So, when you are populating data using any of the drivers and you find a new field which has come thru then you could fire index creation using driver itself and not have to do it manually.
P.S.: I have not tried this and it might not be suitable or advisable.
Hope this helps!

Indexing of dynamic fields is tricky. There is no such thing as wildcard-indexes. Your options would be:
Option A: Whenever you insert a new document, do an ensureIndex with the option sparse:true for each of its fields. This does nothing when the index already exists and creates a new one when it's a new field. The drawback will be that you will end up with a very large number of indexes and that inserts could get slow because of all the new and old indexes which need to be created/updated.
Option B: Forget about the field-names and refactor your documents to an array of key/value pairs. So
"General Features" : {
"Brand" : "Sony",
"Form" : "Bar"
},
"Platform/Software" : {,
"Processor" : "1.5 GHz Qualcomm",
"Graphics" : "Adreno 320"
}
becomes
properties: [
{ category: "General Features", key: "Brand", value: "Sony" },
{ category: "General Features", key: "Form", value: "Bar" },
{ category: "Platform/Software", key: "Processor", value: "1.5 GHz Qualcomm" },
{ category: "Platform/Software", key: "Graphics", value: "Adreno 320" }
]
This allows you to create a single compound index on properties.category and properties.key to cover all the array entries.

Related

Couchbase many to many relationship modeling

I trying to figure out how to model my data - in a many to many relationship in Couchbase (im using n1ql as well).
I have two entities: Clients and Projects.
Client - each client can create many projects - approximately 2000 projects per year.
Project - each project can belong to many clients (maximum 50 clients).
I thought maybe creating a new document for each site/project, but according to Couchbase documentation on data modeling:
This typically isn’t a good approach in Couchbase Server as
referencing and embedding provides a great deal of flexibility to
avoid creating this redundant document.
How should I store the data ?
Any suggestion/advice would be helpful.
Thanks.
Please refer following URL to resolve above issue:
https://developer.couchbase.com/documentation/server/current/data-modeling/modeling-relationships.html
That quote is referencing "relationship documents". In your case, that would mean you'd have a client document, a project document, and some sort of client-project mapping document. I would agree that a document only for a relationship would not be a useful approach, unless you intend to store a lot of information about that relationship.
Based on the information you've given, I'd recommend storing Client documents and Project documents. Based on the numbers, I'd say the projects should contain a list of Client document IDs.
Something like:
key client::001
{
"name" : "Clienty McClientface",
"address" : "123 main st",
"foo" : "bar",
"type" : "client"
}
key project::001
{
"name" : "Alan Parsons Project",
"startDate" : "2012-09-27",
"clients" : [
"client::001",
"client::007",
"client::123",
// ... etc ...
],
"type" : "project"
}
But in general, it depends on what your use cases are for reads, writes, queries. No data model will fit every use case.

efficiancy of indexing frequently update collection in mongodb

I am newbie with MongoDB so my questions might be trivial ... I want to allow my users to upload their address book. the document have the following structure
{
"_id" : "56f29ecc2a00001800dbdf54",
"contacts" : [
{
"name" : "John",
"phoneNumber" : [
"+18144040000"
]
},
{
"name" : "Andrew ",
"phoneNumber" : [
"+14129123456"
]
}
]
}
I would like to run search by phone number in order to find useres with mutual contacts
i.e
{"contacts.phoneNumber":"+14129123456"}
my question is - will it be efficient to add this index
db.addresses.createIndex( { "contacts.phoneNumber": 1 }, { unique: false }, {background: true} )
considering the fact that the user will frequently update his address book from his phone which overrides the current data or insert new one. this upload is a single document with an array of contacts each hold an array of phone numbers .
each upload/update will contain hundreds/thousands of records ?
Your index make sense. Reg the efficiency there is a trade off between read and write. Typically, user interface is expected to respond quickly for any search (i.e. read). So, creating the index on specific field is inevitable. On that basis, indexing on "phone number" is fine considering that the use case required a search or query on phone number directly.
Indexing the document would degrade the write performance. However, this particular index wouldn't degrade the write performance drastically. Having said that if it takes more time, you may need to reconsider the UI design to have progress bar for upload which is a typical UI design for any large uploads.
Also, you can check about the write concern option available in MongoDB. You can configure the write concern whether you are expecting the acknowledgement from the drive or not.
If you consider going with write concern without acknowledgement, it would give you better write performance. However, most of the applications expect acknowledgement on writes to ensure that the write is successful.
https://docs.mongodb.com/manual/reference/write-concern/

Extract data lists from Mongo Documents

As a mongo/nosql newbie with a RDBMS background I wondered what's the best way to proceed.
Currently I've got a large set of documents, containing in some fields, what I consider as "reference datas".
My need is to display in a search interface summarizing the possible values of those "reference fields" to further proceed a filter on my documents set.
Let's take a very simple and stupid example about nourishment.
Here is an extract of some mongo documents:
{ "_id": 1, "name": "apple", "category": "fruit"}
{ "_id": 1, "name": "orange", "category": "fruit"}
{ "_id": 1, "name": "cucumber", "category": "vegetable"}
In the appplication I'd like to have a selectbox displaying all the possible values for "category". Here it would display "fruit" and "vegetable".
What's the best way to proceed ?
extract datas from the existing documents ?
create some reference documents listing unique possible values (as I would do in RDBMS )
store reference data in a rdbms and programatically link mongo and rdbms...
something else ?
The first option is the easiest to implement and should be efficient if you have indexes properly set (see distinct command), so I would go with this.
You could also choose the second option (linking to a reference collection - RDBMS way) which trades performance (you will need more queries for fetching data) for space (you will need less space). Also, this option is preferred if the category is used in other collections as well.
I would advise against using a mixed system (NoSQL + RDBMS) in this case as the other options are better.
You could also store category values directly in application code - depends on your use case. Sometimes it makes sense, although any RDBMS fanatic would burst into tears (or worse) if you tell him that. YMMV. ;)

Can one make a relational database using MongoDB?

I am going to make a student management system using MongoDB. I will have one table for students and another for attendance records. Can I have a key in the attendance table to reach the students table, as pictured below? How?
The idea behind MongoDB is to eliminate (or at least minimize) relational data. Have you considered just embedding the attendance data directly into each student record? This is actually the preferred design pattern for MongoDB and can result in much better performance and scalability.
If you truly need highly relational and normalized data, you might want to reconsider using MongoDB.
The answer depends on how you intend to use the data. You really have 2 options, embed the attendance table, or link it. More on these approaches is detailed here: http://www.mongodb.org/display/DOCS/Schema+Design
For the common use-case, you would probably embed this particular collection, so each student record would have an embedded "attendance" table. This would work because attendance records are unlikely to be shared between students, and retrieving the attendance data is likely to require the student information as well. Retrieving the attendance data would be as simple as:
db.student.find( { login : "sean" } )
{
login : "sean",
first : "Sean",
last : "Hodges",
attendance : [
{ class : "Maths", when : Date("2011-09-19T04:00:10.112Z") },
{ class : "Science", when : Date("2011-09-20T14:36:06.958Z") }
]
}
Yes. There are no hard and fast rules. You have to look at the pros and cons of either embedding or referencing data. This video will definitely help (https://www.youtube.com/watch?v=-o_VGpJP-Q0&t=21s). In your example, the phone number attribute should be on the same table (in a document database), because the phone number of a person rarely changes.

How do document databases deal with changing relationships between objects (or do they at all)?

Say, at the beginning of a project, I want to store a collection of Companies, and within each company, a collection of Employees.
Since I'm using a document database (such as MongoDB), my structure might look something like this:
+ Customers[]
+--Customer
+--Employees[]
+--Employee
+--Employee
+--Customer
+--Employees[]
+--Employee
What happens if, later down the track, a new requirement is to have some Employees work at multiple Companies?
How does one manage this kind of change in a document database?
Doesn't the simplicity of a document database become your worse enemy, since it creates brittle data structures which can't easily be modified?
In the example above, I'd have to run modify scripts to create a new 'Employees' collection, and move every employee into that collection, while maintaining some sort of relationship key (e.g. a CompanyID on each employee).
If I did the above thoroughly enough, I'd end up with many collections, and very little hierarchy, and documents being joined by means of keys.
In that case, am I still using the document database as I should be?
Isn't it becoming more like a relational database?
Speaking about MongoDB specifically...because the database doesn't enforce any relationships like a relational database, you're on the hook for maintaining any sort of data integrity such as this. It's wonderfully helpful in many cases, but you end up writing more application code to handle these sorts of things.
Having said all of that, they key to using a system like MongoDB is modeling your data to fit MongoDB. What you have above makes complete sense if you're using MySQL...using Mongo you'd absolutely get in trouble if you structure your data like it's a relational database.
If you have Employees who can work at one or more Companies, I would structure it as:
// company records
{ _id: 12345, name : 'Apple' }
{ _id: 55555, name : 'Pixar' }
{ _id: 67890, name : 'Microsoft' }
// employees
{ _id : ObjectId('abc123'), name : "Steve Jobs", companies : [ 12345, 55555 ] }
{ _id : ObjectId('abc456'), name : "Steve Ballmer", companies : [ 67890 ] }
You'd add an index on employees.companies, which would make is very fast to get all of the employees who work for a given company...regardless of how many companies they work for. Maintaining a short list of companies per employee will be much easier than maintaining a large list of employees for a company. To get all of the data for a company and all of it's employees would be two (fast) queries.
Doesn't the simplicity of a document
database become your worse enemy,
since it creates brittle data
structures which can't easily be
modified?
The simplicity can bite you, but it's very easy to update and change at a later time. You can script changes via Javascript and run them via the Mongo shell.
My recent answer for this question covers this in the RavenDb context:
How would I model data that is heirarchal and relational in a document-oriented database system like RavenDB?