I am developing a small app which will store information on users, accounts and transactions. The users will have many accounts (probably less than 10) and the accounts will have many transactions (perhaps 1000's). Reading the Docs it seems to suggest that embedding as follows is the way to go...
{
"username": "joe",
"accounts": [
{
"name": "account1",
"transactions": [
{
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
]
},
{
"name": "account2",
"transactions": [
{
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
]
}
]
}
My question is... Since the list of transactions will grow to perhaps 1000's within the document will the data become fragmented and slow the performance. Would I be better to have a document to store the users and the accounts which will not grow as big and then a separate collection to store transactions which are referenced to the accounts. Or is there a better way?
This is not the way to go. You have a lot of transactions, and you don't know how many you will get. Instead of this, you should store them like:
{
"username": "joe",
"name": "account1",
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"username": "joe",
"name": "account1",
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"username": "joe",
"name": "account1",
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
In a NoSQL database like MongoDB you shouldn't be afraid to denormalise. As you noticed, I haven't even bothered with a separate collection for users. If your users have more information that you will have to show with each transaction, you might want to consider including that information as well.
If you need to search on, or select by, any of those fields, then don't forget to create indexes, for example:
// look up all transactions for an account
db.transactions.ensureIndex( { username: 1, name: 1 } );
and:
// look up all transactions for "2013-08-06"
db.transactions.ensureIndex( { date: 1 } );
etc.
There are a lot of advantages to duplicate data. With a schema like above, you can have as many transactions as possible and you will never get any fragmentation as documents never change - you only add to them. This also increases write performance and also makes it a lot easier to do other queries.
Alternative
An alternative might be to store username/name in a collection and only use it's ID with the transactions:
Accounts:
{
"username": "joe",
"name": "account1",
"account_id": 42,
}
Transactions:
{
"account_id": 42,
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
This creates smaller transaction documents, but it does mean you have to do two queries to also get user information.
Since the list of transactions will grow to perhaps 1000's within the document will the data become fragmented and slow the performance.
Almost certainly, infact I would be surprised if over a period of years transactions only reached into the thousands instead of 10's of thousand for a single account.
Added the level of fragmentation you will witness from the consistently growing document over time you could end up with serious problems, if not running out of root document space (with it being 16meg). In fact looking at the fact that you store all accounts for a person under one document I would say you run a high risk of filling up a document in the space of about 2 years.
I would reference this relationship.
I would separate the transactions to a different collections. Seems like the data and update patterns between users and transactions are quite different. If transactions are constantly added to the user and causes it to grow all the time it will be moved a lot in the mongo file. So yes, it brings performance impact (fragmentation, more IO, more work for mongo).
Also, array operation performance sometimes desegregates on big arrays in documents, so holding 1000s of object in an array might not be a good idea (depends on what you do with it).
You should consider creating indexes, using the ensureIndex() function, it should reduce the risk of performance issues.
The earlier you add these, the better you'll understand how the collection should be structured.
I haven't been using mongo too long but I haven't come across any issues(not yet anyway) of data being fragmented
Edit If you intend to use this for multi-object commits, mongo doesn't support rollbacks. You need to use the 64bit version to allow journaling and make transactions durable.
Related
I feel like the MVP of my current database needs some design changes. The number of users is growing quite fast and we are having bad performances in some requests. I also want to get rid of all the DBRef we used.
Our current model can be summarized as follow :
A company can have multiple employees (thousands)
A company can have multiple teams (hundreds)
An employee can be part of a team
A company can have multiple devices (thousands)
An employee is affected to multiple devices
Our application displays in different pages :
The company data
The users
The devices
The teams
I guess I have different options, but I'm not familiar enough with MongoDB to make the best decision.
Option 1
Do not embed and use list of ids for one to many relationships.
// Company document
{
"companyName": "ACME",
"users": [ObjectId(user1), ObjectId(user2)],
"teams": [ObjectId(team1), ObjectId(team2)],
"devices": [ObjectId(device1), ObjectId(device2)]
}
// User Document
{
"userName": "Foo",
"devices": [ObjectId(device2)]
}
// Team Document
{
"teamName": "Foo",
"users": [ObjectId(user1)]
}
// Device Document
{
"deviceName": "Foo"
}
Option 2
Embed data and duplicate informations.
// User Document
{
"companyName": "ACME",
"userName": "Foo",
"team": {
"teamName": "Foo"
},
"device": {
"deviceName": "Foo"
}
}
// Team Document
{
"teamName": "Foo"
"companyName": "ACME",
"users": [
{
"userName": "Foo"
}
]
}
// Device Document
{
"deviceName": "Foo",
"companyName": "ACME",
"user": {
"userName": "Foo"
}
}
Option 3
Do not embed and use id for one to one relationship.
// Company document
{
"companyName": "ACME"
}
// User Document
{
"userName": "Foo",
"company": ObjectId(company),
"team": ObjectId(team1)
}
// Team Document
{
"teamName": "Foo",
"company": ObjectId(company)
}
// Device Document
{
"deviceName": "Foo",
"company": ObjectId(company),
"user": ObjectId(user1)
}
MongoDB recommends to embed data as much as possible but I don't think it can be possible to embed all data in the company document. A company can have multiple devices or users and I believe it can grow too big.
I'm switching from SQL to NoSQL and I think I haven't figured it out by myself yet !
Thanks !
MongodB provides you with a feature which is handling unstructured data.
Every database can contain collection which in turn can contain documents.
Moreover, you cannot use joins in mongodB. So, storing information in one company model is a better choice because you wont be needed join in that scenario.
One more thing, You dont need to embed all the models For example : You can get user and device both from company table, so why embedding users and device as well?
New to MongoDB and databases in general. I'm trying to make a basic property app with Express and MongoDB for practice.
I'm looking for some help on the best way to scheme this out.
Basically, my app would have landlords and tenants. Each landlord would have a bunch of properties that information is stored about. Things like lease terms, tenant name, maintenance requests, images, etc.
The tenants would be able to sign up and be associated with the property they live in. They could submit maintenance forms, etc.
Is this a good approach? Should everything be kept in the same collection? Thanks.
{
"_id": "507f1f77bcf86cd799439011",
"user": "Corey",
"password": "hashed#PASSWORD",
"email": "corey#email.com",
"role": "landlord",
"properties": [
{
"addressId": "1",
"address": "101 Main Street",
"tenant": "John Smith",
"leaseDate": "04/21/2016",
"notes": "These are my notes about this property.",
"images": [ "http://www.imagelink.com/image1", "http://www.imagelink.com/image2", "http://www.imagelink.com/image3"]
},
{
"addressId": "2",
"address": "105 Maple Street",
"tenant": "John Jones",
"leaseDate": "01/01/2018",
"notes": "These are my notes about 105 Maple Ave property.",
"images": ["http://www.imagelink.com/image1", "http://www.imagelink.com/image2", "http://www.imagelink.com/image3"],
"forms": [
{
"formType": "lease",
"leaseTerm": "12 months",
"leaseName": "John Jones",
"leaseDate": "01/01/2018"
},
{
"formtype": "maintenance",
"maintenanceNotes": "Need furnace looked at. Doesn't heat properly.",
"maintenanceName": "John Jones",
"maintenanceDate": "01/04/2018",
"status": "resolved"
},
]
},
{
"addressId": "3",
"address": "110 Chestnut Street",
"tenant": "John Brown",
"leaseDate": "07/28/2014",
"notes": "These are some notes about 110 Chestnut Ave property.",
"images": [ "http://www.imagelink.com/image1", "http://www.imagelink.com/image2", "http://www.imagelink.com/image3"]
}
]
}
{
"_id": "507f1f77bcf86cd799439012",
"user": "John",
"password": "hashed#PASSWORD",
"email": "john#email.com",
"role": "tenant",
"address": "2",
"images": [ "http://www.imagelink.com/image1", "http://www.imagelink.com/image2" ]
}
For this relation I'd suggest three collections (Landlords, Properties, and Tenants), with each tenant having a "landLordId" and "propertyId".
This "landLordId" would simply be the ObjectId of the landLord, and same for the property Id.
This will make your life easier if you plan to do any kind of roll-up reports or if the you have more than one-to-one mappings for landlords to properties or landlords to tenants. (Example, more than one property manager for a given property)
This just makes everything easier/more intuitive as you could simply add things like maintenance requests, lease terms etc in arrays on the tenants with references to whatever need be.
This offers the most flexibility in terms of being able to aggregate easily for any kind of report/query.
Let's assume we have the following collections:
Users
{
"id": MongoId,
"username": "jsloth",
"first_name": "John",
"last_name": "Sloth",
"display_name": "John Sloth"
}
Places
{
"id": MongoId,
"name": "Conference Room",
"description": "Some longer description of this place"
}
Meetings
{
"id": MongoId,
"name": "Very important meeting",
"place": <?>,
"timestamp": "1506493396",
"created_by": <?>
}
Later on, we want to return (e.g. from REST webservice) list of upcoming events like this:
[
{
"id": MongoId(Meetings),
"name": "Very important meeting",
"created_by": {
"id": MongoId(Users),
"display_name": "John Sloth",
},
"place": {
"id": MongoId(Places),
"name": "Conference Room",
}
},
...
]
It's important to return basic information that need to be displayed on the main page in web ui (so no additional calls are needed to render the table). That's why, each entry contains display_name of the user who created it and name of the place. I think that's a pretty common scenario.
Now my question is: how should I store this information in db (question mark values in Metting document)? I see 2 options:
1) Store references to other collections:
place: MongoId(Places)
(+) data is always consistent
(-) additional calls to db have to be made in order to construct the response
2) Denormalize data:
"place": {
"id": MongoId(Places),
"name": "Conference room",
}
(+) no need for additional calls (response can be constructed based on one document)
(-) data must be updated each time related documents are modified
What is the proper way of dealing with such scenario?
If I use option 1), how should I query other documents? Asking about each related document separately seems like an overkill. How about getting last 20 meetings, aggregate the list of related documents and then perform a query like db.users.find({_id: { $in: <id list> }})?
If I go for option 2), how should I keep the data in sync?
Thanks in advance for any advice!
You can keep the DB model you already have and still only do a single query as MongoDB introduced the $lookup aggregation in version 3.2. It is similar to join in RDBMS.
$lookup
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. The $lookup stage does an equality match between a field from the input documents with a field from the documents of the “joined” collection.
So instead of storing a reference to other collections, just store the document ID.
I have a question about designing my application schema. My application handles objects, and status for these objects.
Object and status are quite small object, and in term of volume, it handles few millions of objects, each of these objects can have hundreds to ten of thousands status.
The main write operation of application is adding status (multiple times per day per object).
Read operations are various (list of object, object detail, many aggreagation framework queries to get stats about objects and status).
I try both approaches, it works pretty well for couple of hundred of obejcts with 1000-10000 status but can't figure out which is the more scalable.
Embedded objects :
Objects:
[
{
"id": "0000000001",
"name": "Resource 1",
"description": "Resource 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z",
"status": [
{
"id": "0000000001",
"position": [0, 0],
"comment": "comment 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
},
{
"id": "0000000002",
"position": [0, 0],
"comment": "comment 2",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
}
]
}
]
Referenced objects :
Objects:
[
{
"id": "0000000001",
"name": "Resource 1",
"description": "Resource 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z",
}
]
Status
[
{
"id": "0000000001",
"object_id": "0000000001",
"position": [0, 0],
"comment": "comment 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
},
{
"id": "0000000002",
"object_id": "0000000001",
"position": [0, 0],
"comment": "comment 2",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
}
]
Best regards,
Mickael
Embedded model is preferable If you want to get maximum performance and your single document is smaller then 16 megabytes (maximum BSON document size)
Referenced model in preferable when you need maximum scalability or some complex relationships.
So, if you sure that you can fit to the document size limit, you can use embedded model. But if you want guaranteed scalability - use referenced.
Here is some quotes from Mongo documentation:
In general, use embedded data models when:
you have “contains” relationships between entities
you have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are
viewed in the context of the “one” or parent documents.
Reference to Model One-to-Many Relationships with Embedded Documents
In general, embedding provides better performance for read operations,
as well as the ability to request and retrieve related data in a
single database operation. Embedded data models make it possible to
update related data in a single atomic write operation.
In general, use normalized data models:
when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the
implications of the duplication.
to represent more complex many-to-many relationships.
to model large hierarchical data sets.
References provides more flexibility than embedding. However,
client-side applications must issue follow-up queries to resolve the
references. In other words, normalized data models can require more
roundtrips to the server.
Reference to Model One-to-Many Relationships with Document References
Is this example of nesting generally accepted as good or bad practice (and why)?
A collection called users:
user
basic
name : value
url : value
contact
email
primary : value
secondary : value
address
en-gb
address : value
city : value
state : value
postalcode : value
country : value
es
address : value
city : value
state : value
postalcode : value
country : value
Edit: From the answers in this post I've updated the schema applying the following rules (the data is slightly different from above):
Nest, but only one level deep
Remove unneccesary keys
Make use of arrays to make objects more flexible
{
"_id": ObjectId("4d67965255541fa164000001"),
"name": {
"0": {
"name": "Joe Bloggs",
"il8n": "en"
}
},
"type": "musician",
"url": {
"0": {
"name": "joebloggs",
"il8n": "en"
}
},
"tags": {
"0": {
"name": "guitar",
"points": 3,
"il8n": "en"
}
},
"email": {
"0": {
"address": "joe.bloggs#example.com",
"name": "default",
"primary": 1,
"il8n": "en"
}
},
"updates": {
"0": {
"type": "news",
"il8n": "en"
}
},
"address": {
"0": {
"address": "1 Some street",
"city": "Somecity",
"state": "Somestate",
"postalcode": "SOM STR",
"country": "UK",
"lat": 49.4257641,
"lng": -0.0698241,
"primary": 1,
"il8n": "en"
}
},
"phone": {
"0": {
"number": "+44 (0)123 4567 890",
"name": "Home",
"primary": 1,
"il8n": "en"
},
"1": {
"number": "+44 (0)098 7654 321",
"name": "Mobile",
"il8n": "en"
}
}
}
Thanks!
In my opinion above schema not 'generally accepted', but looks like great. But i suggest some improvements thats will help you to query on your document in future:
User
Name
Url
Emails {email, emailType(primary, secondary)}
Addresses{address, city, state, postalcode, country, language}
Nesting is always good, but two or three level nesting deep can create additional troubles in quering/updating.
Hope my suggestions will help you make right choice of schema design.
You may want to take a look at schema design in MongoDB, and specifically the advice on embedding vs. references.
Embedding is preferred as "Data is then colocated on disk; client-server turnarounds to the database are eliminated". If the parent object is in RAM, then access to the nested objects will always be fast.
In my experience, I've never found any "best practices" for what a MongoDB record actually looks like. The question to really answer is, "Does this MongoDB schema allow me to do what I need to do?"
For example, if you had a list of addresses and needed to update one of them, it'd be a pain since you'd need to iterate through all of them or know which position a particular address was located. You're safe from that since there is a key-value for each address.
However, I'd say nix the basic and contact keys. What do these really give you? If you index name, it'd be basic.name rather than just name. AFAIK, there are some performance impacts to long vs. short key names.
Keep it simple enough to do what you need to do. Try something out and iterate on it...you won't get it right the first time, but the nice thing about mongo is that it's relatively easy to rework your schema as you go.
That is acceptable practice. There are some problems with nesting an array inside of an array. See SERVER-831 for one example. However, you don't seem to be using arrays in your collection at all.
Conversely, if you were to break this up into multiple collections, you would have to deal with a lack of transactions and the resulting race conditions in your data access code.