How to create a Table-Like Reliable Collection - azure-service-fabric

What is the best way to create a table-like reliable collection? Can we role out our own?
I am looking for something to store simple lists or bags for indexes and to track keys and other simple details since the cost of enumerating multi-partition dictionaries is so high. Preferably sequential rather than random access.
The obvious options:
IDictionary <Guid, List> has concurrency issues and poor performance
Try to enumerate a queue but I doubt it will be better than dictionary
Use an external data store
None of these seem particularly good.

The partitioning is actually to gain performance. The trick is to shard your data in such a way that cross partition queries aren't needed. You can also create multiple dictionaries with different aggregates of the same data. (use transactions)
Read more in the chapter 'plan for partitioning' here.

Of course you can roll out your own reliable collection. After all, a reliable collection is just a in-memory data structure backed by a Azure Storage object. If you want a reliable list of string, you can implement IList<string>, and in the various methods (add, remove, getEnumerator, ecc.) insert code to track, persist the data structure.
Based on your content, it can be a table (if you can generate a good partition/row key), or just a blob (and you serialize/deserialize the content each time, or at checkpoints, or... your policy!)
I did not get why IReliableDictionary<K, V> is not good for you. Do you need to store key, value pairs, and you do not want keys to be distributed in partitions, for performance reasons? (Because a "getAll" will spawn machines?)
Or do you need just a list of keys (like colors, or like you have in an HashSet?).
In any case, depending on the size of data, you can partition them differently, using something like IReliableDictionary, where the int can be just one (like "42", and you'll have one partition), or a few (hash and then mod (%) the number you want), and you will get a whole bunch of keys (from every key to N sections of keys) at once.

Related

NoSQL Database Design with multile indexes

I have a DynnamoDB/NoSQL/MongoDB question. I am from an RDBMS backround and struggling to get a design right for a NoSQL DB. If anyone can help!
I have the following objects (tables in my terms):
Organisation
Users
Courses
Units
I want the following access points, of which most are achievable:
Get/Create/Update and Delete Organisation
Get/Create/Update and Delete Users
Get/Create/Update and Delete Courses
Which I can achieve.
Issue is that the Users and Courses objects has many way to retrieve data:
email
username
For example: List Users on course.
List users for Org.
List courses for Org.
List users in org
list users in unit
All these user secondary indexes, which I semi-understand, but I have tertiary..ish indexes, but that probably my design.
Coming from a relational methodology, I am not sure about reporting, how would it work if I wanted to do a search for all users under the courses which have not completed their (call it status flag)?
From what I understand I need indexes for everting I want to search by?
AWS DynamoDB is my preference, but another NoSQL I'll happily consider. I realise that I need more education regarding NoSQL, so please if anyone can provide good documentation and examples which help the learning process, that will be awesome.
Regards
Richard
I have watched a few UDEMY videos and been Gooling for many weeks (oh and checked here "obviously")
Things to keep in mind
Partitions
In DynamoDB everything is organized in partitions that give you hash-based access to elements. This is very powerful in terms of performance but each partition has limits, so similarly to the hash function in hash maps the partition keys should try to equally distribute the elements
Single Table Design
Don't split the data into multiple tables. This makes everything harder and actually limits the capabilities of the DB. Store everything in a single table.
Keys
Keys in dynamo have to be designed around your access patterns. This is the hardest part.
You have Partition Key (Hash Key) -> this key has to be exactly specified every time. You can't perform a query without knowing the PK. This is why placing things like timestamps into PK is really bad idea.
Sort (Range) keys -> these are used for querying as specified in the AWS docs.
Attribute names
DB migrations are really hard in NoSQL so you have to use generic names for the attributes. They should not have any meaning.
For example "UserID" is a bad name for partition key, "PK" is a good name for partition key, same goes for all keys.
Indexes
You have two types of indexes, local and global.
Local Indexes are created once when you create the table and can't be changed (easily) afterward. You can only have a few of them. They give you an extra sort key to work with. The main benefit is that they are strongly consistent
Global Indexes can be created at any time. They give you both new partition key and sort key to work with but are eventually consistent. Got with global indexes unless you have a good reason to use local.
Going back to your problem, if we focus on one of the table as an example - Users
The user can be inserter like this (for example)
PK SK GSI1PK GSI1SK
Username#john123 Email#jhon#gmail.com Email#jhon#gmail.com Username#john123 <User Data>
This way you can query users by email and username. Keep in mind that PK and SK have to be unique pairs. SK in this case is free and can be used for other access patterns (which you didn't provide)
Another way might be to copy the data
PK SK
Username#john123 Email#jhon#gmail.com <user data>
Email#jhon#gmail.com Username#john123 <user data>
this way you avoid having to deal with indexes (which might be expensive sometimes) but you have to manually keep the user data consistent.
Further reading
-> https://www.alexdebrie.com/posts/dynamodb-single-table/
-> my medium post

MongoDB Realm - Is there a way to share documents between partitions to avoid duplication?

Is there a way to share documents between partitions to avoid duplication?
If I have enum data Message Status (sent, sending, failed, undeliverable, etc.), I have to make a copy for each partition, instead of having all share the same statuses.
Example for clarity:
User A has access to partition 1, user B has access to partition 2. For user A to have access to the enum data, it needs to exist with partition key 1. Same thing with user B, he needs a copy of the enum documents with partition key 2, because he cannot access the existing one with partition 1. So the data ends up being duplicated.
Got an answer from another place:
"No, at this time a Realm Object (that translates into a document when stored in MongoDB) can only belong to a single partition: if you need to refer to the same piece of data from multiple partitions, two of the pattern you can follow:
If the data is small, you can use an EmbeddedObject (or a list of them), that, while using more space because of duplicates, allows a quicker access to the data itself without a roundtrip to the DB. If you're talking about enums, this looks like a good choice: an ObjectId would take some amount of data anyway.
Define another partition with the common data, and open a separate, specific realm on the client for it: you'll have to manually handle the relationships, though, i.e. you can store the ObjectIds in your main realm, but will need to query the secondary one for the actual objects. Probably worth only if the common data is complex enough to justify the overhead."

NoSQL best practice : should I save derivative (calculated data) as it is used in app?

I have multiple level deep nested objects including some original data, and then after some user input, calculations are made to the original dataset, and results are kept along with the original data.
Also some other redundant data is stored in the objects by Angular. All this extra data would be easy to programmatically reconstruct by only storing the original data set, the user inputs and some ids..
The easiest (but least economic) version is to store the data as is. This would mean approximately 2-3x size objects, more storage and bandwidth used, etc.
The other version is to store the minimum required data and reconstruct the objects on each query.
The app is not huge (but can grow in the future) and objects aren't either (around 200 keys).
So I'm curious what is best practice to follow in general?
It's a balance between over-denormalizing and having an efficient structure in terms of space and complexity (re keeping everything in sync if you do denormalise).
Start with your user stories, the query patterns this will dictate what information is needed in a single document.
It sounds like this is how you've done it anyway. Embedded sub-docs which you make calculations on during entry. Keep the calculated values in the parent doc and make sure they're updated with the child records. Using sub docs means you can update both the calculated values and emebedded sub docs atomically too.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

How to get distinct count on dynamodb on billion objects?

What is the most efficient way to get a number of how many distinct objects is stored in mine dynamodb?
Such as my objects have ten properties and I want to get a distinct count based on 3 properties.
In case you need counters it's better to use the AtomicCounters (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithDDItems.html). In your case, DynamoDB doesn't support out of the box keys composed out of 3 attributes, unless you concatenate them, so the option would be to create a redundant table where the key is the concatenation of those 3 attributes and each you manage those objects, also update the AtomicCounter (add, delete, update - not needed actually).
Then you just query the counter, avoiding scans. So, it's space complexity to gain speed of retrieving data.
Perform a Scan with the appropriate ScanFilter (in this case, that the three properties are not_null), and use withCount(true) to return only the number of matching records instead of the records themselves.
See the documentation for some example code.