RocksDB newbie here.
In runtime, I only use RocksDB to read the data. Sometimes, I need to merge session-specific records from other sources.
I don't want them to be merged into the main database.
I want them to exist only during the session lifetime for that specific session.
I can, of course, use a regular std::vector or something and merge the RocksDB and the other sources, but that would duplicate the data.
I see a bunch of concepts like memtable and merge, which sound like they might be used or exploited. For example, if I can tell memtable to never commit, and just abandon the changes, that should work. Is it doable?
The easiest way is probably to separate them into different column families, and just dropping the one you don't want to persist, when you shut down your application. If you need per-entry lifetime, you will probably have to consider something custom like a RAII-holder class, if you are using c++, which inserts on construction and deletes on destruction. I would still go with a separate column family to have the data cleanly separated, in case of a crash-failure.
Related
We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist
I've got an application with a lot of stateless microservices, which passes their variable context one to another. I've got a case when I'm starting few chains of services with the same context in parallel and then waiting for them to finish. Each service can modify its variable context, but after all of chains is finished I have to merge their variable contexts and ensure there is no conflicts.
It's illustrated in the examples below:
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
Another solution I see is to find some hashing function, which lets to calculate the hash of modification history by the existing hash and new data, and also lets to check if one history data is prefix of another history data by knowing their hashes only. But I'm unable to find such a function.
I'm looking for any applicable algorithm with has as less data overhead as possible.
What you need are Version clocks, an old idea that can be used to merge paralel data modifications and to detect conflicts.
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
With vector clocks you don't keep the entire history, but a counter for each variable and node (so each variable has a vector of counters).
Storing the whole history of variable modifications doesn't sound too awful, actually. For example, you can put modification information onto a queue, then, have a service that will process that queue by batch of elements at a time and put the result into one single place.
This is a common approach, for example, in situations when there is huge parallel workload and you can't synchronize access to only one place with a lock.
Later you can even scale out workers that process the queue.
I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.
One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.
Background
I have a couple of projects that use a SQLite DB for data. The data stored in the databases are obviously stored across several tables, linked by key/foreign key values.
The thing is that in these databases, if something changes to one record I have to update several other tables. The best example off the top of my head is deleting a record. I have to make sure all other records related to the one being deleted are deleted as well. Now, this example can be solved using key/foreign key values, I believe, but what about more complicated updates?
Now I'm no pro DB admin, but I know that there needs to be data integrity in the DB or things get ugly.
The Question
So, my question. I know that I have greater control when updating related tables programmatically, but at the cost of human error and time. I may miss something or not implement the tables updates correctly and it takes a lot longer to code in the updates. On the other hand, I can put in triggers and let the DB handle the updates to other tables, but I then lose a lot of control.
So, which one is better? Is each better in different situations?
On the other hand, I can put in
triggers and let the DB handle the
updates to other tables, but I then
lose a lot of control.
What control do you think you're losing? If data integrity requires that "such-and-such an update here requires additional updates there and there", you're not losing control by coding that in a trigger. You're centralizing control, and delegating it to the dbms, which is the only piece of software that can guarantee every application follows those requirements.
I know that I have greater control
when updating related tables
programmatically, but at the cost of
human error and time. I may miss
something or not implement the tables
updates correctly and it takes a lot
longer to code in the updates.
You're thinking like a programmer, not a database designer. (That's an observation, not a criticism.) Don't think, "I might miss something". That way of thinking really misses the mark.
Instead, when you're tempted to delegate data integrity to application code, think "Every programmer and every new or changed application that hits this database from now until the end of time has to get it perfectly right."
Now, honestly, does that really sound like a good idea to you?
(The last Fortune 500 company I worked in had programs written in at least two dozen different languages hitting their OLTP database.)
I am working on syncing two business objects between an iPhone and a Web site using an XML-based payload and would love to solicit some ideas for an optimal routine.
The nature of this question is fairly generic though and I can see it being applicable to a variety of different systems that need to sync business objects between a web entity and a client (desktop, mobile phone, etc.)
The business objects can be edited, deleted, and updated on both sides. Both sides can store the object locally but the sync is only initiated on the iPhone side for disconnected viewing. All objects have an updated_at and created_at timestamp and are backed by an RDBMS on both sides (SQLite on the iPhone side and MySQL on the web... again I don't think this matters much) and the phone does record the last time a sync was attempted. Otherwise, no other data is stored (at the moment).
What algorithm would you use to minimize network chatter between the systems for syncing? How would you handle deletes if "soft-deletes" are not an option? What data model changes would you add to facilite this?
The simplest approach: when syncing, transfer all records where updated_at >= #last_sync_at. Down side: this approach doesn't tolerate clock skew very well at all.
It is probably safer to keep a version number column that is incremented each time a row is updated (so that clock skew doesn't foul your sync process) and a last-synced version number (so that potentially conflicting changes can be identified). To make this bandwidth-efficient, keep a cache in each database of the last version sent to each replication peer so that only modified rows need to be transmitted. If this is going to be a star topology, the leaves can use a simplified schema where the last synced version is stored in each table.
Some form of soft-deletes are required in order to support sync of deletes, however this can be in the form of a "tombstone" record which contains only the key of the deleted row. Tombstones can only be safely deleted once you are sure that all replicas have processed them, otherwise it is possible for a straggling replica to resurrect a record you thought was deleted.
So I think in summary your questions relate to disconnected synchronization.
So here is what I think should happen:
Initial Sync You retrieve the data and any information associated with it (row versions, file checksums etc). it is important you store this information and leave it pristine until the next succesful sync. Changes should be made on a COPY of this data.
Tracking Changes If you are dealing with database rows, the idea is, you basically have to track insert, update and delete operations. If you are dealing with text files like xml, then its slightly more complicated. If it likely that multiple users will edit this file at the same time, then you would have to have a diff tool, so conflicts can be detected in a more granular level (instead of the whole file).
Checking for conflicts Again if you are just dealing with database rows, conflicts are easy to detect. You can have another column that increments whenever the row is updated (i think mssql has this builtin not sure about mysql). So if the copy you have has a different number than what's on the server, then you have a conflict. For files or strings, a checksum will do the job. I suppose you could also use modified date but make sure that you have a very precise and accurate measurement to prevent misses. for example: lets say I retrieve a file and you save it as soon as I retrieved it. Lets say the time difference is a 1 millisecond. I then make changes to file then I try to save it. If the recorded last modified time is accurate only to 10 milliseconds, there is a good chance that the file I retrieved will have the same modified date as the one you saved so the program thinks theres no conflict and overwrites your changes. So I generally don't use this method just to be on the safe side. On the other hand the chances of a checksum/hash collision after a minor modification is close to none.
Resolving conflicts Now this is the tricky part. If this is an automated process, then you would have to assess the situation and decide whether you want to overwrite the changes, lose your changes or retrieve the data from the server again and attempt to redo the changes. Luckily for you, it seems that there will be human interaction. But its still a lot of pain to code. If you are dealing with database rows, you can check each individual column and compare it against the data in the server and present it to the user. The idea is to present conflicts to the user in a very granular way so as to not overwhelm them. Most conflicts have very small differences in many different places so present it to the user one small difference at a time. So for text files, its almost the same but more a hundred times more complicated. So basically you would have to create or use a diff tool (Text comparison is a whole different subject and is too broad to mention here) that lets you know of the small changes in the file and where they are in a similar fashion as in a database: where text was inserted, deleted or edited. Then present that to the user in the same way. so basically for each small conflict, the user would have to choose whether to discard their changes, overwrite changes in the server or perform a manual edit before sending to the server.
So if you have done things right, the user should be given a list of conflicts if there are any. These conflicts should be granular enough for the user to decide quickly. So for example, the conflict is a spelling change from, it would be easier for the user to choose from word spellings in contrast to giving the user the whole paragraph and telling him that there was a change and that they have to decide what to do, the user would then have to hunt for this small misspelling.
Other considerations: Data Validation - keep in mind that you have to perform validation after resolving conflicts since the data might have changed Text Comparison - like I said, this is a big subject. so google it! Disconnected Synchronization - I think there are a few articles out there.
Source: https://softwareengineering.stackexchange.com/questions/94634/synchronization-web-service-methodologies-or-papers