NoSQL database design with one-to-many relationships and picking partition key - nosql

I am undertaking NoSQL document design for the below tables:
I have below tables where I am trying to do data modelling:
TaxType
LocationType
Location
TaxAssign
I have above tables in SQL server relational database where we maintain TaxAssignments for Items and scancodes, below is the sample data for TaxType & Location (Location table is self referential table with Fk_LocationId) table:
Below is the table for TaxAssignments:
I am trying to convert above SQL tables into NoSQL document DB, it's a one-to-may relationships between TaxType and TaxAssign, Location & TaxAssign most of the time from above table the queries are based on (FK_RetailItemCode or FK_TaxTypeCode), or by (ScanCode or FK_TaxTypeCode)
I want to design the document json, but it's been very hard for me to pick the partition key, ItemCode, ScanCode are queried a lot but they are optional fields so I cannot include them as part of partition key, so I picked UniqueIdentifer as partitionkey to spread out data into multiple logical partitions.
Did I pick the right key? When I query I don't query by UniqueIdentifier but by ItemCode or ScanCode with TaxType optional.
Below is my JSON document, are there any modifications or changes required in the design or should I take a different approach in order to design this:
{
"UniqueIdentifier": "1999-10-20-07.55.05.090087",
"EffectiveDate": "1999-10-20",
"TerminationDate": "9999-12-31",
"LocationId": 1,
"FK_RetailItemCode": 852874,
"FK_TaxTypeCode": 1,
"TaxType": [
{
"TaxTypeCode": 1,
"TaxArea": "STATE ",
"Description": "SALES TAX ",
"IsPOSTaxEnabled": "Y",
"POSTaxField": 1,
"TaxOrder": 0,
"IsCityLimit": " ",
"IsTaxFree": false,
"Location": [
{
"LocationId": 1,
"LocationType": "ST",
"City": " ",
"County": " ",
"Country": "USA ",
"CountyCode": 0,
"State": "ALABAMA ",
"StateShortName": "AL",
"SortSequence": 40
}
]
}
]
},
{
"UniqueIdentifier": "2019-06-13-08.51.48.004124",
"EffectiveDate": "2019-06-13",
"TerminationDate": "2019-08-05",
"LocationId": 13531,
"FK_RetailItemCode": 852784,
"FK_TaxTypeCode": 16,
"TaxType": [
{
"TaxTypeCode": 16,
"TaxArea": "CITY ",
"Description": "HOSPTLY TAX OUT CITY LIM ",
"IsPOSTaxEnabled": "Y",
"POSTaxField": 2,
"TaxOrder": 1,
"IsCityLimit": "N",
"IsTaxFree": false,
"Location": [
{
"LocationId": 13531,
"LocationType": "CI",
"City": "FOLEY ",
"County": "BALDWIN ",
"Country": "USA ",
"CountyCode": 2,
"State": "ALABAMA ",
"StateShortName": "AL",
"FK_LocationId": 13510
}
]
}
]
}
TaxAssignment is a huge table with 6 millon data, so I want to spread data as much as I can so I picked UniqueIdentifier as partition key, I couldn't pick the other partition key which are queried so often as that columns ItemCode & ScanCode are optional (nullable).
Questions:
As I have one-to-many relationship, can I embed location & taxType inside each TaxAssignment.
Is it OK to pick UniqueIdentifier as partition key even though the partition key is never used to query against the collection.
Should I denormalize the whole Json with TaxType & Location instead of embedding them inside each tax assignnment.
for any changes to taxtype and location metadata, I might need to make changes to taxtype and location in a lot of places. What design approaches can I use here?
TaxType --> number of records is 19
Location --> number of records is 38000
TaxAssign --> number of records is 6 million.

It's rather difficult to answer questions like this and I don't think I can fully answer yours here but I can provide some advice and also point you to resources to help you figure this out yourself.
For 1:few relationship you typically will want to embed this. TaxType and LocationType sound like good candidate to embed. However you will want to maintain these in their own container and use ChangeFeed to update them in your TaxAssignments table.
For 1:Many relationships, especially if they are unbounded you typically will want to reference this. Not sure if Locations is unbounded in your app. If it isn't then embed this too.
Typically you never want to pick a partition key that is never used in a query. If you do then EVERY query will be cross partition so the larger your container gets, the worse your app will perform. Basically it will not scale.
If you have some queries that use one property for queries and other queries that use another property, one option would be to use ChangeFeed and keep duplicate copies of the data optimized for those queries. However, you will want to measure the cost of these queries and multiply that by the number of times a month the query is run, then calculate the cost for using Change Feed. Change Feed consumes 2 RU/s each second to poll your container, then 1 RU/s for each 1Kb of less of data read and ~8 RU/s for each insert into the target container depending on your index policy. Multiply that by the number of new/updated records per month and you have a decent estimate.
If you want more details on how to design this type of database feel free to check out these links.
Modeling and partitioning session from Ignite 2019
GitHub Repo that shows the demos run for that session but in .NET
Doc on modeling and partitioning

Related

Best way to represent multilingual database on mongodb

I have a MySQL database to support a multilingual website where the data is represented as the following:
table1
id
is_active
created
table1_lang
table1_id
name
surname
address
What's the best way to achieve the same on mongo database?
You can either design a schema where you can reference or embed documents. Let's look at the first option of embedded documents. With you above application, you might store the information in a document as follows:
// db.table1 schema
{
"_id": 3, // table1_id
"is_active": true,
"created": ISODate("2015-04-07T16:00:30.798Z"),
"lang": [
{
"name": "foo",
"surname": "bar",
"address": "xxx"
},
{
"name": "abc",
"surname": "def",
"address": "xyz"
}
]
}
In the example schema above, you would have essentially embedded the table1_lang information within the main table1document. This design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. If your application frequently accesses table1 information along with the table1_lang data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a document which has a lang key "name" with value "foo", this can be done with one single (atomic) operation:
db.table.remove({"lang.name": "foo"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents where you follow a normalized schema. For example:
// db.table1 schema
{
"_id": 3
"is_active": true
"created": ISODate("2015-04-07T16:00:30.798Z")
}
// db.table1_lang schema
/*
1
*/
{
"_id": 1,
"table1_id": 3,
"name": "foo",
"surname": "bar",
"address": "xxx"
}
/*
2
*/
{
"_id": 2,
"table1_id": 3,
"name": "abc",
"surname": "def",
"address": "xyz"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child table1_lang documents for the main parent entity table1 with id 3 will be straightforward, simply create a query against the collection table1_lang:
db.table1_lang.find({"table1_id": 3});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of table_lang documents per give table entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

DynamoDB Model/Keys Advice

I was hoping someone could help me understand how to best design my table(s) for DynamoDb. I'm building an application which is used to track the visits a certain user makes to another user's profile.
Currently I have a MongoDB where one entry contains the following fields:
userId
visitedProfileId
date
status
isMobile
How would this translate to DynamoDB in a way it would not be too slow? I would need to do search queries to select all items that have a certain userId, taking the status and isMobile in affect. What would me keys be? Can I use limit functionality to only request the latest x entries (sorted on date?).
I really like the way DynamoDB can be used but it really seems kind of complicated to make the click between a regular NoSQL database and a key-value nosql database.
There are a couple of ways you could do this - and it probably depends on any other querying you may want to do on this table.
Make your HashKey of the table the userId, and then the RangeKey can be <status>:<isMobile>:<date> (eg active:true:2013-03-25T04:05:06.789Z). Then you can query using BEGINS_WITH in the RangeKeyCondition (and ScanIndexForward set to false to return in ascending order).
So let's say you wanted to find the 20 most recent rows of user ID 1234abcd that have a status of active and an isMobile of true (I'm guessing that's what you mean by "taking [them] into affect"), then your query would look like:
{
"TableName": "Users",
"Limit": 20,
"HashKeyValue": { "S": "1234abcd" },
"RangeKeyCondition": {
"ComparisonOperator": "BEGINS_WITH"
"AttributeValueList": [{ "S": "active:true:" }],
},
"ScanIndexForward": false
}
Another way would be to make the HashKey <userId>:<status>:<isMobile>, and the RangeKey would just be the date. You wouldn't need a RangeKeyCondition in this case (and in the example, the HashKeyValue would be { "S": "1234abcd:active:true" }).

Logging file access with MongoDB

I am designing my first MongoDB (and first NoSQL) database and would like to store information about files in a collection. As part of each file document, I would like to store a log of file accesses (both reads and writes).
I was considering creating an array of log messages as part of the document:
{
"filename": "some_file_name",
"logs" : [
{ "timestamp": "2012-08-27 11:40:45", "user": "joe", "access": "read" },
{ "timestamp": "2012-08-27 11:41:01", "user": "mary", "access": "write" },
{ "timestamp": "2012-08-27 11:43:23", "user": "joe", "access": "read" }
]
}
Each log message will contain a timestamp, the type of access, and the username of the person accessing the file. I figured that this would allow very quick access to the logs for a particular file, probably the most common operation that will be performed with the logs.
I know that MongoDB has a 16Mbyte document size limit. I imagine that files that are accessed very frequently could push against this limit.
Is there a better way to design the NoSQL schema for this type of logging?
Lets first try to calculate avg size of the one log record:
timestamp word = 18, timestamp value = 8, user word = 8, user value=20 (10 chars it is max(or avg for sure) I guess), access word = 12, access value 10. So total is 76 bytes. So you can have ~220000 of log records.
And half of physical space will be used by field names. In case if you will name timestamp = t, user = u, access=a -- you will be able to store ~440000 of log items.
So, i think it is enough for the most systems. In my projects I always trying to embed rather than create separate collection, because it a way to achieve good performance with mongodb.
In the future you can move your logs records into separate collection. Also for performance you can have like a 30 last log records (simple denormalize them) in file document, for fast retrieving in addition to logs collection.
Also if you will go with one collection, make sure that you not loading logs when you no need them (you can include/exclude fields in mongodb). Also use $slice to do paging.
And one last thing: Enjoy mongo!
If you think document limit will become an issue there are few alternatives.
The obvious one is to simple create a new document for each log.
So you will have a collecton "logs". With this schema.
{
"filename": "some_file_name",
"timestamp": "2012-08-27 11:40:45",
"user": "joe",
"access": "read"
}
A query to find which files "joe" read will be something like the
db.logs.find({user: "joe", access: "read"})

Unique vote, disable revote

I'm building simple Web App where users can vote.
What is the fastest way for checking if user has already voted. I'm interested in both relation databases and document based databases (mongodb,...)
I have few ideas but I am sure they can be improved:
Relation databases
Create a seperate table for voting:
|userid|articleid|
Before incrementing articles vote check if there is a row including both userid and articleid. We have two queries. Is possible to improve this with triggers? For example:
|useridarticleid| unique column
Before vote generate useridarticleid on application side. Try to insert useridarticleid. Trigger will fire if field is new and it will increment our vote column in article.
Document based
This is a bit more trickier. So having document structured like so:
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
First "query" - check if userid is in votes array. Second "query" - Increment num_votes if not.
Again two queries. So I thought we can change this but I don't know really if it will increase performance:
Insert userid in votes array. When user want to check article "count" votes in array. But I think it possible that performance will drop because if traffic is high counting every article is a bit of waste. Imagine Reddit here.
Actually, it's a lot simpler in a document database. Your document structure is perfect for it.
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
db.collection.update(
{id:"123", votes:{$ne:"userid"}},
{$push:{"votes":"userid"},$inc:{"num_votes":1}}
);
This will atomically update record id=123 adding userid to list of voters and incrementing votes by one only if userid is not already in the list of votes on this document.
So there is only one query and one update - and they are actually the same operation.
In a relational database |userid|articleid| would be the best approach, using both fields as primary keys.
In the second one you can also consider wther putting the votes in the user document, or in the article document.
Anyway, I'd suggest you really focus on creating a design, where changing all this decisions later is easy.
The different ways of designing this, favor things like "A lot of users at the same article at the same time" or "A lot of users in different articles", etc... Until you can see the real usage, you won't have enough information to decide which approach will work best and fastest... So create something that you can easily adapt to whatever information you learn later.
BTW: You might also consider don't counting the votes synchronically. I remember an article (which I can't find) where it mentioned that you tube votes numbers weren't actually "accurate"... They put an estimation of the current votes, and calculated the real number in a background worker thread.