I trying to figure out how to model my data - in a many to many relationship in Couchbase (im using n1ql as well).
I have two entities: Clients and Projects.
Client - each client can create many projects - approximately 2000 projects per year.
Project - each project can belong to many clients (maximum 50 clients).
I thought maybe creating a new document for each site/project, but according to Couchbase documentation on data modeling:
This typically isn’t a good approach in Couchbase Server as
referencing and embedding provides a great deal of flexibility to
avoid creating this redundant document.
How should I store the data ?
Any suggestion/advice would be helpful.
Thanks.
Please refer following URL to resolve above issue:
https://developer.couchbase.com/documentation/server/current/data-modeling/modeling-relationships.html
That quote is referencing "relationship documents". In your case, that would mean you'd have a client document, a project document, and some sort of client-project mapping document. I would agree that a document only for a relationship would not be a useful approach, unless you intend to store a lot of information about that relationship.
Based on the information you've given, I'd recommend storing Client documents and Project documents. Based on the numbers, I'd say the projects should contain a list of Client document IDs.
Something like:
key client::001
{
"name" : "Clienty McClientface",
"address" : "123 main st",
"foo" : "bar",
"type" : "client"
}
key project::001
{
"name" : "Alan Parsons Project",
"startDate" : "2012-09-27",
"clients" : [
"client::001",
"client::007",
"client::123",
// ... etc ...
],
"type" : "project"
}
But in general, it depends on what your use cases are for reads, writes, queries. No data model will fit every use case.
Related
I am subscribing to Orion Context Broker data using Cygnus. Cygnus stores the data on MongoDB like the following. Is there a possibility to store the attrValue as float not as String to be able to use Mongo's aggregation features?
> db['cygnus_/kurapath_enocean_power_enocean'].find().pretty()
{
"_id" : ObjectId("55e81e9631d7791085668331"),
"recvTime" : ISODate("2015-09-03T10:19:02Z"),
"attrName" : "power",
"attrType" : "string",
"attrValue" : "2085.0"
}
Not currently, mainly because Cygnus does not (always) receives information about the real type of an entity's attribute. The entity's "type" Orion sents is just a description of the type, I mean, it could be anything like "float" or "number_of_potatos". It is true that some reserved words, such as "float", have been chosen in recent versions of Orion in order to describe effective float numbers, and in that case the type could be used to persist effective float numbers in Mongo (or whatever backend you use), but many other attributes will continue having an unknown type. Thus, currently everything is considered as a string.
In addition, it must be said another feature is under study: the possibility to notify some special entities, the "entity models" fully describing a class of entities.
Most probably next coming releases will be implementing an effective typing in some of the above directions.
Anyway, did you see the OrionSTHSink? Despite its name (STH, Short-Term Historic), it is a sink that already creates aggregations of data in a MongoDB.
I have a collection in which below is the data:
"sel_att" : {
"Technical Specifications" : {
"In Sales Package" : "Charger, Handset, User Manual, Extra Ear Buds, USB Cable, Headset",
"Warranty" : "1 year manufacturer warranty for Phone and 6 months warranty for in the box accessories"
},
"General Features" : {
"Brand" : "Sony",
"Model" : "Xperia Z",
"Form" : "Bar",
"SIM Size" : "Micro SIM",
"SIM Type" : "Single Sim, GSM",
"Touch Screen" : "Yes, Capacitive",
"Business Features" : "Document Viewer, Pushmail (Mail for Exchange, ActiveSync)",
"Call Features" : "Conference Call, Hands Free, Loudspeaker, Call Divert",
"Product Color" : "Black"
},
"Platform/Software" : {
"Operating Frequency" : "GSM - 850, 900, 1800, 1900; UMTS - 2100",
"Operating System" : "Android v4.1 (Jelly Bean), Upgradable to v4.4 (KitKat)",
"Processor" : "1.5 GHz Qualcomm Snapdragon S4 Pro, Quad Core",
"Graphics" : "Adreno 320"
}
}
The data mentioned above is too huge and the fields are all dynamically inserted, how can I index such fields to get faster results?
It seems to me that you have not fully understood the power of document based databases such as MongoDB.
Bellow are just a few thoughts:
you have 1 million records
you have 1 million index values for that collection
you have to RAM available to store 1 million index values in-memory, otherwise the benefits of indexing would not be so keen to show up
yes you can have sharding but you need lots of hardware to accommodate basic needs
What you for sure need is something that can make dynamically link random text to valuable indexes and that allows you to search in vast amounts of text very fast. And for that you should use a tool like ElasticSearch.
Note that you can and should store your content in a NoSQL database and yes MongoDB is a viable option. And for the indexing part ElasticSearch has plugins available to enhance the communication between the two.
P.S. If I recall correctly the plugin is called MongoDB River
EDIT:
I've also added a more comprehensive definition for ElasticSearch. I won't take credit for it since I've grabbed it from Wikipedia:
Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a
RESTful web interface and schema-free JSON documents
EDIT 2:
I've scaled down a bit on the numbers since it might be far-fetched for most projects. But the main idea remains the same. Indexes are not recommended for the use-case described in the question.
Based on what you want to query, you will end up indexing those fields. You can also have secondary indexes in MongoDB. But beware creating too many indexes may improve your query performance but consume additional disk space and make inserts slower due to re-indexing.
MongoDB indexes
Short answer: you can't. Use Elastic Search.
Here is a good tutorial to setup MongoDB River on Elastic Search
The reason is simple, MongoDB does not work like that. It helps you store complex schemaless sets of documents. But you cannot index dozens of different fields and hope to get good performance. Generally a max of 5-6 indices are recommended per collection.
Elastic Search is commonly used in the fashion described above in many other use-cases, so it is an established pattern. For example, Titan Graph DB has the built-in option to use ES for this purpose. If I were you, I would just use that and would not try to make MongoDB do something it is not built to do.
If you have the time and if your data structure lends itself to (I think it might from the json above), then you could also use rdbms to break down these pieces and store them on-the-fly with an EAV like pattern. Elastic Search would be easier to start and probably easier to achieve performance quickly.
Well, there are lots of problems w.r.t having many indexes and has been discussed here. But if at all you need to add indexes for dynamic fields you actually create index from you mongo db driver.
So, lets say if you are using the Mongodb JAVA driver then you could create an index like below: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/#creating-an-index
coll.createIndex(new BasicDBObject("i", 1)); // create index on "i", ascending
PYTHON
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.create_index
So, when you are populating data using any of the drivers and you find a new field which has come thru then you could fire index creation using driver itself and not have to do it manually.
P.S.: I have not tried this and it might not be suitable or advisable.
Hope this helps!
Indexing of dynamic fields is tricky. There is no such thing as wildcard-indexes. Your options would be:
Option A: Whenever you insert a new document, do an ensureIndex with the option sparse:true for each of its fields. This does nothing when the index already exists and creates a new one when it's a new field. The drawback will be that you will end up with a very large number of indexes and that inserts could get slow because of all the new and old indexes which need to be created/updated.
Option B: Forget about the field-names and refactor your documents to an array of key/value pairs. So
"General Features" : {
"Brand" : "Sony",
"Form" : "Bar"
},
"Platform/Software" : {,
"Processor" : "1.5 GHz Qualcomm",
"Graphics" : "Adreno 320"
}
becomes
properties: [
{ category: "General Features", key: "Brand", value: "Sony" },
{ category: "General Features", key: "Form", value: "Bar" },
{ category: "Platform/Software", key: "Processor", value: "1.5 GHz Qualcomm" },
{ category: "Platform/Software", key: "Graphics", value: "Adreno 320" }
]
This allows you to create a single compound index on properties.category and properties.key to cover all the array entries.
I am going to make a student management system using MongoDB. I will have one table for students and another for attendance records. Can I have a key in the attendance table to reach the students table, as pictured below? How?
The idea behind MongoDB is to eliminate (or at least minimize) relational data. Have you considered just embedding the attendance data directly into each student record? This is actually the preferred design pattern for MongoDB and can result in much better performance and scalability.
If you truly need highly relational and normalized data, you might want to reconsider using MongoDB.
The answer depends on how you intend to use the data. You really have 2 options, embed the attendance table, or link it. More on these approaches is detailed here: http://www.mongodb.org/display/DOCS/Schema+Design
For the common use-case, you would probably embed this particular collection, so each student record would have an embedded "attendance" table. This would work because attendance records are unlikely to be shared between students, and retrieving the attendance data is likely to require the student information as well. Retrieving the attendance data would be as simple as:
db.student.find( { login : "sean" } )
{
login : "sean",
first : "Sean",
last : "Hodges",
attendance : [
{ class : "Maths", when : Date("2011-09-19T04:00:10.112Z") },
{ class : "Science", when : Date("2011-09-20T14:36:06.958Z") }
]
}
Yes. There are no hard and fast rules. You have to look at the pros and cons of either embedding or referencing data. This video will definitely help (https://www.youtube.com/watch?v=-o_VGpJP-Q0&t=21s). In your example, the phone number attribute should be on the same table (in a document database), because the phone number of a person rarely changes.
Say, at the beginning of a project, I want to store a collection of Companies, and within each company, a collection of Employees.
Since I'm using a document database (such as MongoDB), my structure might look something like this:
+ Customers[]
+--Customer
+--Employees[]
+--Employee
+--Employee
+--Customer
+--Employees[]
+--Employee
What happens if, later down the track, a new requirement is to have some Employees work at multiple Companies?
How does one manage this kind of change in a document database?
Doesn't the simplicity of a document database become your worse enemy, since it creates brittle data structures which can't easily be modified?
In the example above, I'd have to run modify scripts to create a new 'Employees' collection, and move every employee into that collection, while maintaining some sort of relationship key (e.g. a CompanyID on each employee).
If I did the above thoroughly enough, I'd end up with many collections, and very little hierarchy, and documents being joined by means of keys.
In that case, am I still using the document database as I should be?
Isn't it becoming more like a relational database?
Speaking about MongoDB specifically...because the database doesn't enforce any relationships like a relational database, you're on the hook for maintaining any sort of data integrity such as this. It's wonderfully helpful in many cases, but you end up writing more application code to handle these sorts of things.
Having said all of that, they key to using a system like MongoDB is modeling your data to fit MongoDB. What you have above makes complete sense if you're using MySQL...using Mongo you'd absolutely get in trouble if you structure your data like it's a relational database.
If you have Employees who can work at one or more Companies, I would structure it as:
// company records
{ _id: 12345, name : 'Apple' }
{ _id: 55555, name : 'Pixar' }
{ _id: 67890, name : 'Microsoft' }
// employees
{ _id : ObjectId('abc123'), name : "Steve Jobs", companies : [ 12345, 55555 ] }
{ _id : ObjectId('abc456'), name : "Steve Ballmer", companies : [ 67890 ] }
You'd add an index on employees.companies, which would make is very fast to get all of the employees who work for a given company...regardless of how many companies they work for. Maintaining a short list of companies per employee will be much easier than maintaining a large list of employees for a company. To get all of the data for a company and all of it's employees would be two (fast) queries.
Doesn't the simplicity of a document
database become your worse enemy,
since it creates brittle data
structures which can't easily be
modified?
The simplicity can bite you, but it's very easy to update and change at a later time. You can script changes via Javascript and run them via the Mongo shell.
My recent answer for this question covers this in the RavenDb context:
How would I model data that is heirarchal and relational in a document-oriented database system like RavenDB?
My application creates pieces of data that, in xml, would look like this:
<resource url="someurl">
<term>
<name>somename</name>
<frequency>somenumber</frequency>
</term>
...
...
...
</resource>
This is how I'm storing these "resources" now. A resource per XML file. As many "term" per "resource" as needed.
The problem is, I'll need to generate about 2 million of these resources.
I've generated almost 500.000 and my mac isn't very happy about it.
So my question is: how should I store this data?
A database? that would be hard, because the structure of the data isn't fixed...
Maybe merge some resources into larger XML files?
...?
I don't need to change the data once it's created.
Right now I'm accessing a specific resource by the name of that resource's file.
Any suggestions are greatly appreciated!
Not all databases are relational. Have a look at for example mongodb. It stores your data as json-like objects, similar to your resources.
An example using the shell:
$ mongo
> db.resources.save({url: "someurl",
terms: [{name: "name1", frequency: 17.0},
{name: "name2", frequency: 42.0}]})
> db.resources.find()
{"_id" : ObjectId( "4b00884b3a77b8b2fa3a8f77"),
"url" : "someurl" ,
"terms" : [{"name" : "name1" , "frequency" : 17},
{"name" : "name2" , "frequency" : 42}]}
If your can't predict how your data is going to be organized, maybe http://couchdb.apache.org/ can be interesting for you. It is a schema-less database.
Anyways, XML is maybe not the best choice for big amout of data.
Maybe trying JSON or YAML works out better? They need less space and are easier to parse (I have however no experience on using those formats on larger scale. Maybe I'm wrong).
You should deffinetely have several resourses per XML file, but only if you are expected to have all the resources toguether at the same time. If you need to send only a handfull of resourses to anybody, then keep making the individual XML.
Even in that situation, you could keep the large XML file, and generate on demand the smaller ones from the original dataset.
Using a database like SQLite3 would allow you to have faster seek times and easier manipulation of the data, using SQL syntax.