How to organize data in document based stores? - rdbms

I think about using a NoSQL (document based) database instead a RDBMS (PostgreSQL). I'm used to RDBMS and it feels weird for me to arrange the data in a correct way. Usually I would split up the (simplified data) data like:
Some notes:
each player can have as many friends as he wants
each player can have 3-10 troops
each player can have 30-100 units
each player can have #troops * 5 troop units (same unit in different troops)
fight history stores the last x fights
player can have ~5 chests
Now I would like to know how to structure the data in a document based database system. I would split it up like:
Document Player: Player, Friend, Chest, Troop, Troop_Unit, Unit
Document Meta: Fight_History
So it will a really big document containing everything which relies to the core of the game. In addition to that there will be a document storing meta data like history data (battle history, message history, ...). Does it make sense?
I would like to have some tips organizing data in document stored systems.
Please no "Stick to PostgreSQL" or "Use what you know" tips.

Related

How to handle static data in ES/CQRS?

After reading dozens of articles and watching hours of videos, I don't seem to get an answer to a simple question:
Should static data be included in the events of the write/read models?
Let's take the oh-so-common "orders" example.
In all examples you'll likely see something like:
class OrderCreated(Event):
....
class LineAdded(Event):
itemID
itemCount
itemPrice
But in practice, you will also have lots of "static" data (products, locations, categories, vendors, etc).
For example, we have a STATIC products table, with their SKUs, description, etc. But in all examples, the STATIC data is never part of the event.
What I don't understand is this:
Command side: should the STATIC data be included in the event? If so, which data? Should the entire "product" record be included? But a product also has a category and a vendor. Should their data be in the event as well?
Query side: should the STATIC data be included in the model/view? Or can/should it be JOINED with the static table when an actual query is executed.
If static data is NOT part of the event, then the projector cannot add it to the read model, which implies that the query MUST use joins.
If static data IS part of the event, then let's say we change something in the products table (e.g. typo in the item description), this change will not be reflected in the read model.
So, what's the right approach to using static data with ES/CQRS?
Should static data be included in the events of the write/read models?
"It depends".
First thing to note is that ES/CQRS are a distraction from this question.
CQRS is simply the creation of two objects where there was previously only one. -- Greg Young
In other words, CQRS is a response to the idea that we want to make different trade offs when reading information out of a system than when writing information into the system.
Similarly, ES just means that the data model should be an append only sequence of immutable documents that describe changes of information.
Storing snapshots of your domain entities (be that a single document in a document store, or rows in a relational database, or whatever) has to solve the same problems with "static" data.
For data that is truly immutable (the ratio of a circle's circumference and diameter is the same today as it was a billion years ago), pretty much anything works.
When you are dealing with information that changes over time, you need to be aware of the fact that that the answer changes depending on when you ask it.
Consider:
Monday: we accept an order from a customer
Tuesday: we update the prices in the product catalog
Wednesday: we invoice the customer
Thursday: we update the prices in the product catalog
Friday: we print a report for this order
What price should appear in the report? Does the answer change if the revised prices went down rather than up?
Recommended reading: Helland 2015
Roughly, if you are going to need now's information later, then you need to either (a) write the information down now or (b) write down the information you'll need later to look up now's information (ex: id + timestamp).
Furthermore, in a distributed system, you'll need to think about the implications when part of the system is unavailable (ex: what happens if we are trying to invoice, but the product catalog is unavailable? can we cache the data ahead of time?)
Sometimes, this sort of thing can turn into a complete tangle until you discover that you are missing some domain concept (the invoice depends on a price from a quote, not the catalog price) or that you have your service boundaries drawn incorrectly (Udi Dahan talks about this often).
So the "easy" part of the answer is that you should expect time to be a concept you model in your solution. After that, it gets context sensitive very quickly, and discovering the "right" answer may involve investigating subtle questions.

Firestore Geopoint in an Arrays

I am working with an interesting scenario that I am not sure can work, or will work well. With my current project I am trying to find an efficient way of working with geopoints in firestore. The straight forward approach where a document can contain a geopoints field is pretty self explanatory and easy to query. However, I am having to work with a varying amount of geopoints for a single document (Article). The reason for this is because a specific piece of content may need to be available in more than one geographic area.
For example, An article may need to be available only in NYC, Denver and Seattle. Using a geopoint for each location and searching by radius, in general, is a pretty standard task if I only wanted the article to be available in Seattle, but now it needs to be available in two more places.
The solution as I see it currently is to use an array and fill it with geopoints. The structure would look something like this:
articleText (String),
sortTime (Timestamp),
tags (Array)
- ['tagA','tagB','tagC','tagD'],
availableLocations (Array)
- [(Geopoint), (Geopoint), (Geopoint), (Geopoint)]
Then performing a query to get all content within 10 miles of a specific Geopoint starting at a specific postTime.
What I don't know is if putting the geopoints in an array works well or should be avoided in favor or another data structure.
I have considered replicating an article document for each geopoint, but that does not scale very well if more than a handful of locations needs defining. I've also considered creating a "reference" collection where each point is a document that contains the documentID of an article, but this leads to reading each reference document then reading the actual document. Essentially two document reads for 1 piece of content, which can get expensive based on the Firestore pricing model, and may slow things down unnecessarily.
Am I approaching this in an acceptable way? And are there other methods that can work more efficiently?

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

How to store data in MongoDB that needs to be accessed from both ends (bidirectional)

I have a bunch of players and teams. Teams are made up of lots of players. Players can belong to more than one team. I need to find which players are on a given team or which teams a given player is on.
Would it be best to use DBRefs, a player collection with embedded teams, a team collection with embedded players, both collections, something else, or is MongoDB just not a good choice at all here?
I hope you've read this page as well as multiple other discussions of schema design in MongoDB - it can be very helpful to see similar examples and what other people are taking into account.
Having said that, it doesn't sound like your data set is going to be very large - is it? If that's the case it probably doesn't matter hugely for performance and you should probably strive to design first for clarity and ease of access from your application.
I would think that it's reasonable to have a collection of teams and a collection of players and then decide if you want to have an array of players embedded in each team representing who is on that team, or an array of teams embedded into each player document indicating which teams they are on. The array could be an array of IDs or IDs and names, etc.
Whichever of the above you pick, querying is easy - if you have team arrays embedded in players, to find out all the players on a particular team you can just query for all players who have that team in their array. The query might look something like:
db.players.find({"teams.name":"TeamRocky"})
This will return all the player documents for players who have that team name in their "teams" array (this is just an example, your actual implementation may be quite different). It would be just as easy if you embedded player array in each team document instead.
Things to consider when deciding this should probably include not just how you will be querying the data but also how and how often you expect to be updating the data - will players be moving from team to team? Will teams dissolve and if so, will you have to track their historical information, etc.
I would encourage you to try some schema and build a small prototype then see what seems problematic and change/evolve the schema as you get more familiar with your application requirements.

How to store tree base data into Data base?

I am developing an application of iphone which is navigation based, and i am showing data on first View Items then on second View Sub Items and so on. So my question is that what will good approach to save this on Data base (sqlite).
Keep this simple.
Each object/View has it's own ID and at least one parent ID.
This will ensure your data can represent trees of any depth and any complexity.
i am not an expert in this field but you can do it like this ... now that you said all you data can be represented something like tree...
Find all the objects that will be leaf of your tree make those objects as a table in DB, (please keep in mind that you make table only for objects that have different structure not because they have different values)
Repeat above step for one level above until you reach top
Eventually you will find that you just got your DB.
To be more precise you need to study DBMS