I have a MySQL database to support a multilingual website where the data is represented as the following:
table1
id
is_active
created
table1_lang
table1_id
name
surname
address
What's the best way to achieve the same on mongo database?
You can either design a schema where you can reference or embed documents. Let's look at the first option of embedded documents. With you above application, you might store the information in a document as follows:
// db.table1 schema
{
"_id": 3, // table1_id
"is_active": true,
"created": ISODate("2015-04-07T16:00:30.798Z"),
"lang": [
{
"name": "foo",
"surname": "bar",
"address": "xxx"
},
{
"name": "abc",
"surname": "def",
"address": "xyz"
}
]
}
In the example schema above, you would have essentially embedded the table1_lang information within the main table1document. This design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. If your application frequently accesses table1 information along with the table1_lang data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a document which has a lang key "name" with value "foo", this can be done with one single (atomic) operation:
db.table.remove({"lang.name": "foo"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents where you follow a normalized schema. For example:
// db.table1 schema
{
"_id": 3
"is_active": true
"created": ISODate("2015-04-07T16:00:30.798Z")
}
// db.table1_lang schema
/*
1
*/
{
"_id": 1,
"table1_id": 3,
"name": "foo",
"surname": "bar",
"address": "xxx"
}
/*
2
*/
{
"_id": 2,
"table1_id": 3,
"name": "abc",
"surname": "def",
"address": "xyz"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child table1_lang documents for the main parent entity table1 with id 3 will be straightforward, simply create a query against the collection table1_lang:
db.table1_lang.find({"table1_id": 3});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of table_lang documents per give table entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
Related
I am undertaking NoSQL document design for the below tables:
I have below tables where I am trying to do data modelling:
TaxType
LocationType
Location
TaxAssign
I have above tables in SQL server relational database where we maintain TaxAssignments for Items and scancodes, below is the sample data for TaxType & Location (Location table is self referential table with Fk_LocationId) table:
Below is the table for TaxAssignments:
I am trying to convert above SQL tables into NoSQL document DB, it's a one-to-may relationships between TaxType and TaxAssign, Location & TaxAssign most of the time from above table the queries are based on (FK_RetailItemCode or FK_TaxTypeCode), or by (ScanCode or FK_TaxTypeCode)
I want to design the document json, but it's been very hard for me to pick the partition key, ItemCode, ScanCode are queried a lot but they are optional fields so I cannot include them as part of partition key, so I picked UniqueIdentifer as partitionkey to spread out data into multiple logical partitions.
Did I pick the right key? When I query I don't query by UniqueIdentifier but by ItemCode or ScanCode with TaxType optional.
Below is my JSON document, are there any modifications or changes required in the design or should I take a different approach in order to design this:
{
"UniqueIdentifier": "1999-10-20-07.55.05.090087",
"EffectiveDate": "1999-10-20",
"TerminationDate": "9999-12-31",
"LocationId": 1,
"FK_RetailItemCode": 852874,
"FK_TaxTypeCode": 1,
"TaxType": [
{
"TaxTypeCode": 1,
"TaxArea": "STATE ",
"Description": "SALES TAX ",
"IsPOSTaxEnabled": "Y",
"POSTaxField": 1,
"TaxOrder": 0,
"IsCityLimit": " ",
"IsTaxFree": false,
"Location": [
{
"LocationId": 1,
"LocationType": "ST",
"City": " ",
"County": " ",
"Country": "USA ",
"CountyCode": 0,
"State": "ALABAMA ",
"StateShortName": "AL",
"SortSequence": 40
}
]
}
]
},
{
"UniqueIdentifier": "2019-06-13-08.51.48.004124",
"EffectiveDate": "2019-06-13",
"TerminationDate": "2019-08-05",
"LocationId": 13531,
"FK_RetailItemCode": 852784,
"FK_TaxTypeCode": 16,
"TaxType": [
{
"TaxTypeCode": 16,
"TaxArea": "CITY ",
"Description": "HOSPTLY TAX OUT CITY LIM ",
"IsPOSTaxEnabled": "Y",
"POSTaxField": 2,
"TaxOrder": 1,
"IsCityLimit": "N",
"IsTaxFree": false,
"Location": [
{
"LocationId": 13531,
"LocationType": "CI",
"City": "FOLEY ",
"County": "BALDWIN ",
"Country": "USA ",
"CountyCode": 2,
"State": "ALABAMA ",
"StateShortName": "AL",
"FK_LocationId": 13510
}
]
}
]
}
TaxAssignment is a huge table with 6 millon data, so I want to spread data as much as I can so I picked UniqueIdentifier as partition key, I couldn't pick the other partition key which are queried so often as that columns ItemCode & ScanCode are optional (nullable).
Questions:
As I have one-to-many relationship, can I embed location & taxType inside each TaxAssignment.
Is it OK to pick UniqueIdentifier as partition key even though the partition key is never used to query against the collection.
Should I denormalize the whole Json with TaxType & Location instead of embedding them inside each tax assignnment.
for any changes to taxtype and location metadata, I might need to make changes to taxtype and location in a lot of places. What design approaches can I use here?
TaxType --> number of records is 19
Location --> number of records is 38000
TaxAssign --> number of records is 6 million.
It's rather difficult to answer questions like this and I don't think I can fully answer yours here but I can provide some advice and also point you to resources to help you figure this out yourself.
For 1:few relationship you typically will want to embed this. TaxType and LocationType sound like good candidate to embed. However you will want to maintain these in their own container and use ChangeFeed to update them in your TaxAssignments table.
For 1:Many relationships, especially if they are unbounded you typically will want to reference this. Not sure if Locations is unbounded in your app. If it isn't then embed this too.
Typically you never want to pick a partition key that is never used in a query. If you do then EVERY query will be cross partition so the larger your container gets, the worse your app will perform. Basically it will not scale.
If you have some queries that use one property for queries and other queries that use another property, one option would be to use ChangeFeed and keep duplicate copies of the data optimized for those queries. However, you will want to measure the cost of these queries and multiply that by the number of times a month the query is run, then calculate the cost for using Change Feed. Change Feed consumes 2 RU/s each second to poll your container, then 1 RU/s for each 1Kb of less of data read and ~8 RU/s for each insert into the target container depending on your index policy. Multiply that by the number of new/updated records per month and you have a decent estimate.
If you want more details on how to design this type of database feel free to check out these links.
Modeling and partitioning session from Ignite 2019
GitHub Repo that shows the demos run for that session but in .NET
Doc on modeling and partitioning
I have a mongo database with few collections such as a user in the system (id, name, email) and list of projects (id, name, list of users who have access)
User
{
"_id": 1,
"name": "John",
"email": "john#domain.com"
}
{
"_id": 2,
"name": "Sam",
"email": "sam#domain.com"
}
Project
{
"_id": 1,
"name": "My Project1",
"users": [1,2]
}
{
"_id": 2,
"name": My Project2",
"users": [2]
}
In my dashboard, I display a list of projects and the names of its users. To support names - I've changed the "users" field to now also include the name:
{
"_id": 2,
"name": "My Project2",
"users": [{"_id":2,"name":"Sam"}]
}
But on several pages, I now need to also print their email address and later on - maybe also display their image.
Since I don't want to start and embed the entire User document in each project, I'm looking for a way to do a LEFT JOIN and pick the values I need from the User collection.
Performances are NOT important so much on those pages and I rather prefer an easy way to manage my data. So basically I'm looking for a way to query for a list of all projects and associated users with different fields from the original User document.
I've read about the map-reduce and aggregation option of mongo and to be honest - I'm not sure which to use and how to achieve what I'm looking for.
MongoDb doesn't support joins in any form even by using MapReduce and Aggregation Framework. Only way you could implement join between collection is in your code. So just implement LEFT JOIN logic in your code.
We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.
I am designing my first MongoDB (and first NoSQL) database and would like to store information about files in a collection. As part of each file document, I would like to store a log of file accesses (both reads and writes).
I was considering creating an array of log messages as part of the document:
{
"filename": "some_file_name",
"logs" : [
{ "timestamp": "2012-08-27 11:40:45", "user": "joe", "access": "read" },
{ "timestamp": "2012-08-27 11:41:01", "user": "mary", "access": "write" },
{ "timestamp": "2012-08-27 11:43:23", "user": "joe", "access": "read" }
]
}
Each log message will contain a timestamp, the type of access, and the username of the person accessing the file. I figured that this would allow very quick access to the logs for a particular file, probably the most common operation that will be performed with the logs.
I know that MongoDB has a 16Mbyte document size limit. I imagine that files that are accessed very frequently could push against this limit.
Is there a better way to design the NoSQL schema for this type of logging?
Lets first try to calculate avg size of the one log record:
timestamp word = 18, timestamp value = 8, user word = 8, user value=20 (10 chars it is max(or avg for sure) I guess), access word = 12, access value 10. So total is 76 bytes. So you can have ~220000 of log records.
And half of physical space will be used by field names. In case if you will name timestamp = t, user = u, access=a -- you will be able to store ~440000 of log items.
So, i think it is enough for the most systems. In my projects I always trying to embed rather than create separate collection, because it a way to achieve good performance with mongodb.
In the future you can move your logs records into separate collection. Also for performance you can have like a 30 last log records (simple denormalize them) in file document, for fast retrieving in addition to logs collection.
Also if you will go with one collection, make sure that you not loading logs when you no need them (you can include/exclude fields in mongodb). Also use $slice to do paging.
And one last thing: Enjoy mongo!
If you think document limit will become an issue there are few alternatives.
The obvious one is to simple create a new document for each log.
So you will have a collecton "logs". With this schema.
{
"filename": "some_file_name",
"timestamp": "2012-08-27 11:40:45",
"user": "joe",
"access": "read"
}
A query to find which files "joe" read will be something like the
db.logs.find({user: "joe", access: "read"})
I want to implement Spring Security with MongoDB.
How can I define custom User schema?
One of the greatest awesomeness of MongoDB is that it is schemaless, i.e. you are not forced to use some predefined set of columns. Another MongoDB feature is lack of JOINs.
These two statements mean that you may construct any schema you want, but try to have all required info in one collection. For example I used schema like this in one of my applications:
{
"_id": "student_001",
"password": "65c20e5a89d6b13df450b50576e2edfb",
"firstName": "A",
"lastName": "B",
"secondName": "C",
"email": "a#b.c",
"role": "STUDENT",
"active": true,
"paid": 0.6
}
You can use _id for any unique field of any type (not only ObjectId), I use it for logins. You just need to cover basic org.springframework.security.core.userdetails.UserDetails getters with data from this schema, but also you can add additional fields to the implementing class.