Syncing objects between two disparate systems, best approach? - iphone

I am working on syncing two business objects between an iPhone and a Web site using an XML-based payload and would love to solicit some ideas for an optimal routine.
The nature of this question is fairly generic though and I can see it being applicable to a variety of different systems that need to sync business objects between a web entity and a client (desktop, mobile phone, etc.)
The business objects can be edited, deleted, and updated on both sides. Both sides can store the object locally but the sync is only initiated on the iPhone side for disconnected viewing. All objects have an updated_at and created_at timestamp and are backed by an RDBMS on both sides (SQLite on the iPhone side and MySQL on the web... again I don't think this matters much) and the phone does record the last time a sync was attempted. Otherwise, no other data is stored (at the moment).
What algorithm would you use to minimize network chatter between the systems for syncing? How would you handle deletes if "soft-deletes" are not an option? What data model changes would you add to facilite this?

The simplest approach: when syncing, transfer all records where updated_at >= #last_sync_at. Down side: this approach doesn't tolerate clock skew very well at all.
It is probably safer to keep a version number column that is incremented each time a row is updated (so that clock skew doesn't foul your sync process) and a last-synced version number (so that potentially conflicting changes can be identified). To make this bandwidth-efficient, keep a cache in each database of the last version sent to each replication peer so that only modified rows need to be transmitted. If this is going to be a star topology, the leaves can use a simplified schema where the last synced version is stored in each table.
Some form of soft-deletes are required in order to support sync of deletes, however this can be in the form of a "tombstone" record which contains only the key of the deleted row. Tombstones can only be safely deleted once you are sure that all replicas have processed them, otherwise it is possible for a straggling replica to resurrect a record you thought was deleted.

So I think in summary your questions relate to disconnected synchronization.
So here is what I think should happen:
Initial Sync You retrieve the data and any information associated with it (row versions, file checksums etc). it is important you store this information and leave it pristine until the next succesful sync. Changes should be made on a COPY of this data.
Tracking Changes If you are dealing with database rows, the idea is, you basically have to track insert, update and delete operations. If you are dealing with text files like xml, then its slightly more complicated. If it likely that multiple users will edit this file at the same time, then you would have to have a diff tool, so conflicts can be detected in a more granular level (instead of the whole file).
Checking for conflicts Again if you are just dealing with database rows, conflicts are easy to detect. You can have another column that increments whenever the row is updated (i think mssql has this builtin not sure about mysql). So if the copy you have has a different number than what's on the server, then you have a conflict. For files or strings, a checksum will do the job. I suppose you could also use modified date but make sure that you have a very precise and accurate measurement to prevent misses. for example: lets say I retrieve a file and you save it as soon as I retrieved it. Lets say the time difference is a 1 millisecond. I then make changes to file then I try to save it. If the recorded last modified time is accurate only to 10 milliseconds, there is a good chance that the file I retrieved will have the same modified date as the one you saved so the program thinks theres no conflict and overwrites your changes. So I generally don't use this method just to be on the safe side. On the other hand the chances of a checksum/hash collision after a minor modification is close to none.
Resolving conflicts Now this is the tricky part. If this is an automated process, then you would have to assess the situation and decide whether you want to overwrite the changes, lose your changes or retrieve the data from the server again and attempt to redo the changes. Luckily for you, it seems that there will be human interaction. But its still a lot of pain to code. If you are dealing with database rows, you can check each individual column and compare it against the data in the server and present it to the user. The idea is to present conflicts to the user in a very granular way so as to not overwhelm them. Most conflicts have very small differences in many different places so present it to the user one small difference at a time. So for text files, its almost the same but more a hundred times more complicated. So basically you would have to create or use a diff tool (Text comparison is a whole different subject and is too broad to mention here) that lets you know of the small changes in the file and where they are in a similar fashion as in a database: where text was inserted, deleted or edited. Then present that to the user in the same way. so basically for each small conflict, the user would have to choose whether to discard their changes, overwrite changes in the server or perform a manual edit before sending to the server.
So if you have done things right, the user should be given a list of conflicts if there are any. These conflicts should be granular enough for the user to decide quickly. So for example, the conflict is a spelling change from, it would be easier for the user to choose from word spellings in contrast to giving the user the whole paragraph and telling him that there was a change and that they have to decide what to do, the user would then have to hunt for this small misspelling.
Other considerations: Data Validation - keep in mind that you have to perform validation after resolving conflicts since the data might have changed Text Comparison - like I said, this is a big subject. so google it! Disconnected Synchronization - I think there are a few articles out there.
Source: https://softwareengineering.stackexchange.com/questions/94634/synchronization-web-service-methodologies-or-papers

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

Simulating an Oracle sequence with MongoDB

Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

Is a document/NoSQL database a good candidate for storing a balance sheet?

If I were to create a basic personal accounting system (because I'm like that - it's a hobby project about a domain I'm familiar enough with to avoid getting bogged-down in requirements), would a NoSQL/document database like RavenDB be a good candidate for storing the accounts and more importantly, transactions against those accounts? How do I choose which entity is the "document"?
I suspect this is one of those cases were actually a SQL database is the right fit and trying to go NoSQL is the mistake, but then when I think of what little I know of CQRS and event sourcing, I wonder if the entity/document is actually the Account, and the transactions are Events stored against it, and that when these "events" occur, maybe my application also then writes out to a easily queryable read store like a SQL database.
Many thanks in advance.
Personally think it is a good idea, but I am a little biased because my full time job is building an accounting system which is based on CQRS, Event Sourcing, and a document database.
Here is why:
Event Sourcing and Accounting are based on the same principle. You don't delete anything, you only modify. If you add a transaction that is wrong, you don't delete it. You create an offset transaction. Same thing with events, you don't delete them, you just create an event that cancels out the first one. This means you are publishing a lot of TransactionAddedEvent.
Next, if you are doing double entry accounting, recording a transaction is different than the way your view it on a screen (especially in a balance sheet). Hence, my liking for cqrs again. We can store the data using correct accounting principles but our read model can be optimized to show the data the way you want to view it.
In a balance sheet, you want to view all entries for a given account. You don't want to see the transaction because the transaction has two sides. You only want to see the entry that affects that account.
So in your document db you would have an entries collection.
This makes querying very easy. If you want to see all of the entries for an account you just say SELECT * FROM Entries WHERE AccountId = 1. I know that is SQL but everyone understands the simplicity of this query. It just as easy in a document db. Plus, it will be lightning fast.
You can then create a balance sheet with a query grouping by accountid, and setting a restriction on the date. Notice no joins are needed at all, which makes a document db a great choice.
Theory and Architecture
If you dig around in accounting theory and history a while, you'll see that the "documents" ought to be the source documents -- purchase order, invoice, check, and so on. Accounting records are standardized summaries of those usually-human-readable source documents. An accounting transaction is two or more records that hit two or more accounts, tied together, with debits and credits balancing. Account balances, reports like a balance sheet or P&L, and so on are just summaries of those transactions.
Think of it as a layered architecture -- the bottom layer, the foundation, is the source documents. If the source is electronic, then it goes into the accounting system's document storage layer -- this is where a nosql db might be useful. If the source is a piece of paper, then image it and/or file it with an index number that is then stored in the accounting system's document layer. The next layer up is digital records summarizing those documents; each document is summarized by one or more unbalanced transaction legs. The next layer up is balanced transactions; each transaction is composed of two or more of those unbalanced legs. The top layer is the financial statements that summarize those balanced transactions.
Source Documents and External Applications
The source documents are the "single source of truth" -- not the records that describe them. You should always be able to rebuild the entire db from the source documents. In a way, the db is just an index into the source documents in the first place. Way too many people forget this, and write accounting software in which the transactions themselves are considered the source of truth. This causes a need for a whole 'nother storage and workflow system for the source documents themselves, and you wind up with a typical modern corporate mess.
This all implies that any applications that write to the accounting system should only create source documents, adding them to that bottom layer. In practice though, this gets bypassed all the time, with applications directly creating transactions. This means that the source document, rather than being in the accounting system, is now way over there in the application that created the transaction; that is fragile.
Events, Workflow, and Digitizing
If you're working with some sort of event model, then the right place to use an event is to attach a source document to it. The event then triggers that document getting parsed into the right accounting records. That parsing can be done programatically if the source document is already digital, or manually if the source is a piece of paper or an unformatted message -- sounds like the beginnings of a workflow system, right? You still want to keep that original source document around somewhere though. A document db does seem like a good idea for that, particularly if it supports a schema where you can tie the source documents to their resulting parsed and balanced records and vice versa.
You can certainly create such a system.
In that scenario, you have the Account Aggregate, and you also have the TimePeriod Aggregate.
The time period is usually a Month, a Quarter or a Year.
Inside each TimePeriod, you have the Transactions for that period.
That means that loading the current state is very fast, and you have the full log in which you can go backward.
The reason for TimePeriod is that this is usually the boundary in which you actually think about such things.
In this case, a relational database is the most appropriate, since you have relational data (eg. rows and columns)
Since this is just a personal system, you are highly unlikely to have any scale or performance issues.
That being said, it would be an interesting exercise for personal growth and learning to use a document-based DB like RavenDB. Traditionally, finance has always been a very formal thing, and relational databases are typically considered more formal and rigorous than document databases. But, like you said, the domain for this application is under your control, and is fairly straight forward, so complexity and requirements would not get in the way of designing the system.
If it was my own personal pet project, and I wanted to learn more about a new-ish technology and see if it worked in a particular domain, I would go with whatever I found interesting and if it didn't work very well, then I learned something. But, your mileage may vary. :)

In a SQLite database is it better to use tirggers to handle cascading table changes, or is it better to do it programmatically?

Background
I have a couple of projects that use a SQLite DB for data. The data stored in the databases are obviously stored across several tables, linked by key/foreign key values.
The thing is that in these databases, if something changes to one record I have to update several other tables. The best example off the top of my head is deleting a record. I have to make sure all other records related to the one being deleted are deleted as well. Now, this example can be solved using key/foreign key values, I believe, but what about more complicated updates?
Now I'm no pro DB admin, but I know that there needs to be data integrity in the DB or things get ugly.
The Question
So, my question. I know that I have greater control when updating related tables programmatically, but at the cost of human error and time. I may miss something or not implement the tables updates correctly and it takes a lot longer to code in the updates. On the other hand, I can put in triggers and let the DB handle the updates to other tables, but I then lose a lot of control.
So, which one is better? Is each better in different situations?
On the other hand, I can put in
triggers and let the DB handle the
updates to other tables, but I then
lose a lot of control.
What control do you think you're losing? If data integrity requires that "such-and-such an update here requires additional updates there and there", you're not losing control by coding that in a trigger. You're centralizing control, and delegating it to the dbms, which is the only piece of software that can guarantee every application follows those requirements.
I know that I have greater control
when updating related tables
programmatically, but at the cost of
human error and time. I may miss
something or not implement the tables
updates correctly and it takes a lot
longer to code in the updates.
You're thinking like a programmer, not a database designer. (That's an observation, not a criticism.) Don't think, "I might miss something". That way of thinking really misses the mark.
Instead, when you're tempted to delegate data integrity to application code, think "Every programmer and every new or changed application that hits this database from now until the end of time has to get it perfectly right."
Now, honestly, does that really sound like a good idea to you?
(The last Fortune 500 company I worked in had programs written in at least two dozen different languages hitting their OLTP database.)