Does Firebase Real time database always read the complete node if you reference it? - unity3d

On an hypothetic node structure like:
NodeA:
-Subnode1: 000000001
-Subnode2: "thisIsAVeeeeeeeeeeeryLoooongString"
I would like to update the NodeA every X minutes, just write it, not reading it, Subnode1 would be a timestamp which I set with Server.TimeStamp and Subnode2 would be a changing string.
I would like to know if just by referencing 'NodeA' Firebase will read the contents of the whole node, and if it does, is there a way to avoid it? since the Subnode2 can be quite heavy and I would like to have control whenever I want to read it.
Clarifications:
I'm not reading the node using any querying function. My question arises because I wonder if when the app starts the referenced nodes (using dbReference = fbbase.GetReference(path)) are read automatically.
I know I could use different references for each node but then I would incur into different upload costs since it would mean 2 different connections (yes, the uploads also have costs depending on the frequency)
I'm using Firebase SDK for Unity.
Thanks in advance.

If you query NodeA, it will pull down the entire contents of that node, including all of its children.
If you want just a specific child, query it instead. You can certainly build a path to Subnode1 if you want.
There is no way to exclude a certain child from a query, while getting all others. If you don't want all children, you must query each desired child individually.

Firebase rtdb charges on storage volumes and data downloads. If you are simply updating the record in the node you should not incur costs other than minor network costs.

A reference does not incure any fee's
that being said, reads and writes do.
Reason being is a reference is a hypothetical location for a document or a query and does not nessessarily exist until its contents has been populated by an update snapshot
when you read or write to a node, your data + overhead is calculated based on the current cost model per kb

Related

Updating Redundant data/denormalized data in NoSQL(Aerospike)

My question is that I am having a problem where I need to update the data which is been denormalized due to being in NoSQL because a single update in one data needs to be updated in all other redundant data.
For eg: Consider an e-commerce database where there is one table "Products" which contains all the details about a product , let's say name,imageName, LogoImage
Now in this case the LogoImage of various "Products" table entry can be same, and now I need to update the LogoImage, so I need to update in all the fields which contains the given LogoImage. which seems like a very poor solution
So is there any better way to do that?
P.S.: If we seperate logo and Products into 2 different table , so when I need to get 1000 products at a time , I need to get the related logos by implementing a client level join type thing, which is also not a good solution.
You're suggesting using the database as your CDN and storing the binary image in it? That's not a great approach, in my opinion. You should be storing that image in an actual CDN like Amazon Cloudfront, or a simple one like Amazon S3, or your own webserver as a file. Whichever, the point is that you should be referring to it by URI. In Aerospike you would store the metadata about that image, not the image itself.
Next, you can have two sets - prod for products and prodimg for product images. The various products store a list of IDs referring to the product image set. The product image set has metadata about each image as a separate record { uri, name, title, width, length, ... } . If anything changes about this image, you just update the one record with the metadata for that image in prodimg. No need to change anything about the products.
And you don't really need JOIN functionality in this case. Your application can get the prod record first, and use the bin (images) that has all the IDs of the images for the product (each referring to a key of a record in prodimg). You can then issue either a few get operations (reads) or a single batch-read for all of them if there are many. The latencies for Aerospike are such that this will return faster and scale better than an equivalent JOIN in an RDBMS. A batch-read is a multi-node, multi-core, multi-threaded operation. A cluster of 3 multi-core nodes has plenty of parallel computing power.
Again, if you "need 1000 products at a time" use batch-read. In the Java client that's an AerospikeClient.get() with a list of Key objects. In the Python client that's an aerospike.Client.get_many. Every Aerospike client has batch-read functionality.

Snapshot taking and restore strategies

I've been reading about CQRS+EventSoucing patterns (which I wish to apply in a near future) and one point common to all decks and presentations I found is to take snapshots of your model state in order to restore it, but none of these share patterns/strategies of doing that.
I wonder if you could share your thoughts and experience in this matter particularly in terms of:
When to snapshot
How to model a snapshot store
Application/cache cold start
TL;DR: How have you implemented Snapshotting in your CQRS+EventSourcing application? Pros and Cons?
Rule #1: Don't.
Rule #2: Don't.
Snapshotting an event sourced model is a performance optimization. The first rule of performance optimization? Don't.
Specifically, snapshotting reduces the amount of time you lose in your repository trying to reload the history of your model from your event store.
If your repository can keep the model in memory, then you aren't going to be reloading it very often. So the win from snapshotting will be small. Therefore: don't.
If you can decompose your model into aggregates, which is to say that you can decompose the history of your model into a number of entities that have non-overlapping histories, then your one model long model history becomes many many short histories that each describe the changes to a single entity. Each entity history that you need to load will be pretty short, so the win from a snapshot will be small. Therefore: don't.
The kind of systems I'm working today require high performance but not 24x7 availability. So in a situation where I shut down my system for maintenace and restart it I'd have to load and reprocess all my event store as my fresh system doesn't know which aggregate ids to process the events. I need a better starting point for my systems to restart be more efficient.
You are worried about missing a write SLA when the repository memory caches are cold, and you have long model histories with lots of events to reload. Bolting on snapshotting might be a lot more reasonable than trying to refactor your model history into smaller streams. OK....
The snapshot store is a read model -- at any point in time, you should be able to blow away the model and rebuild it from the persisted history in the event store.
From the perspective of the repository, the snapshot store is a cache; if no snapshot is available, or if the store itself doesn't respond within the SLA, you want to fall back to reprocessing the entire event history, starting from the initial seed state.
The service provider interface is going to look something like
interface SnapshotClient {
SnapshotRecord getSnapshot(Identifier id)
}
SnapshotRecord is going to provide to the repository the information it needs to consume the snapshot. That's going to include at a minimum
a memento that allows the repository to rehydrate the snapshotted state
a description of the last event processed by the snapshot projector when building the snapshot.
The model will then re-hydrate the snapshotted state from the memento, load the history from the event store, scanning backwards (ie, starting from the most recent event) looking for the event documented in the SnapshotRecord, then apply the subsequent events in order.
The SnapshotRepository itself could be a key-value store (at most one record for any given id), but a relational database with blob support will work fine too
select *
from snapshots s
where id = ?
order by s.total_events desc
limit 1
The snapshot projector and the repository are tightly coupled -- they need to agree on what the state of the entity should be for all possible histories, they need to agree how to de/re-hydrate the memento, and they need to agree which id will be used to locate the snapshot.
The tight coupling also means that you don't need to worry particularly about the
schema for the memento; a byte array will be fine.
They don't, however, need to agree with previous incarnations of themselves. Snapshot Projector 2.0 discards/ignores any snapshots left behind by Snapshot Projector 1.0 -- the snapshot store is just a cache after all.
i'm designing an application that will probably generate millions event a day. what can we do if we need to rebuild a view 6 month later
One of the more compelling answers here is to model time explicitly. Do you have one entity that lives for six months, or do you have 180+ entities that each live for one day? Accounting is a good domain to reference here: at the end of the fiscal year, the books are closed, and the next year's books are opened with the carryover.
Yves Reynhout frequently talks about modeling time and scheduling; Evolving a Model may be a good starting point.
There are few instances you need to snapshot for sure. But there are a couple - a common example is an account in a ledger. You'll have thousands maybe millions of credit/debit events producing the final BALANCE state of the account - it would be insane not to snapshot that every so often.
My approach to snapshoting when I designed Aggregates.NET was its off by default and to enable your aggregates or entities must inherit from AggregateWithMemento or EntityWithMemento which in turn your entity must define a RestoreSnapshot, a TakeSnapshot and a ShouldTakeSnapshot
The decision whether to take a snapshot or not is left up to the entity itself. A common pattern is
Boolean ShouldTakeSnapshot() {
return this.Version % 50 == 0;
}
Which of course would take a snapshot every 50 events.
When reading the entity stream the first thing we do is check for a snapshot then read the rest of the entity's stream from the moment the snapshot was taken. IE: Don't ask for the entire stream just the part we have not snapshoted.
As for the store - you can use literally anything. VOU is right though a key-value store is best because you only need to 1. check if one exists 2. load the entire thing - which is ideal for kv
For system restarts - I'm not really following what your described problem is. There's no reason for your domain server to be stateful in the sense that its doing something different at different points in time. It should do just 1 thing - process the next command. In the process of handling a command it loads data from the event store, including a snapshot, runs the command against the entity which either produces a business exception or domain events which are recorded to the store.
I think you may be trying to optimize too much with this talk of caching and cold starts.

Detecting concurrent data modification of document between read and write

I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

How do I model a queue on top of a key-value store efficiently?

Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?
Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.
I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.