Distributed database which allows custom CRDT merging - mongodb

I‘m rather new to distributed databases, though I have already studied related literature (e.g. CAP theorem, CRDT) and implemented some POC to allow scaling my application horizontally.
Now I however face a challenging problem. In ordere to scale the app horizontally, communication between services is done via a distributed queue. As a background here, I do require a custom CRDT method to keep the data eventually consistent, and I do require my application to work like a cache (remotely related to REDIS).
The challenge is now that I also need to persist the data. That requires me to keep the data within the application cache and database eventually consistent. I‘ve checked Cassandra, I saw a ticket [1] where somebody tried to add functionality for custom CRDT merge functionality (which as I mentioned do require for a reason). That never made it into Cassandra, and seems to have a few issues to resolve.
What are my options, either in form of a concrete distributed database engine allowing custom merging, or an algorithm that could help solve the problem (e.g. in form of a db trigger or something like this).
[1] https://issues.apache.org/jira/browse/CASSANDRA-6412

As far as I know, there are very few databases that allow you to specify your own custom conflict resolution algorithms. Tbh. the only one I really found - disclaimer: I'm not a Microsoft Advocate - is Azure CosmosDB. It has MongoDB-compatible API and can be configured to use master-master replication strategy, where you need to specify your own conflict resolution algorithm (using JavaScript). You can use it to define your own merge operation.
If you'll take a look outside of database-native solutions into application-level ones, there are several tools, like ie. Akka (available in both JVM or .NET version) which enables you to write custom CRDTs inside of distributed-data module. JVM version additionally supports multi-datacenter persistence, which is conceptually closer to how commutative CRDTs work and can be integrated with Cassandra backend.

I've implemented a MerkleClock CRDT at my merkle-crdt repository.
You could use an approach that when you update the database record column, you fetch the column's value and then you merge it with your CRDT of your current state and then when you save, you serialise the CRDT as JSON and store it in the database.

Related

Using Sails.js with AWS DynamoDB....not ideal

I started working on a small POC and decided to give Sails.js a try :)
Part of the POC we wanted to use DynamoDB since the project will eventually involve high scalability and we're not looking to hire full-time MongoDB expert at this point.
We used the module: https://github.com/gadelkareem/sails-dynamodb
Problem is there is no documentation and the module does not even work...
It seems the sails ORM is not ideal for DynamoDB and requires writing custom DB services. Does anyone have experience with this?
I was very excited to come across Sails but if it won't let us play nice with DynamoDB then it might very well be out as an option to us....
Anyone have experience with this or maybe something I'm missing?
One of the important plus of vogels is excellent documentation.
Sails-dynamodb adapter based on the vogels, but not all features are implemented in sails-dynamodb adapter. For example, vogels has Expression Filters.
Vogels able to create tables. Adapter can't. An adapter needs duplication table schema in sails files and dynamodb shell.
Vogels has some own types, such as uuid type, StringSet, NumberSet, TimeUUID. (Adapter can use it too, if includes Vogels and Joi lib)
Vogels and adapter have the same query (create, update, delete, find) capabilities.
Adapter allows without changing the code switch to another data base. Adapter encapsulates establishment of connection to database.
Conclusion - for most purposes this adapter is suitable for the work and do not need to work directly with the Vogels
Sails comes loaded with an ORM called "Waterline". There are some official waterline plugins such as mongodb, postgresql, mysql and then there are some unofficial ones created by the community. I'd assume right now that Dynamo is in the latter category since I have not come across it before. However, with that being said I would not take this experience as a reason to ditch Sails.js.
Sails.js is built with the intention that all of its components can be swapped out, this means you are not tied to a specific template engine, authentication libraries etc. and including your ORM choice.
Waterline is still being actively developed but it is sat at v0.12.1 as of writing this response. It isn't fully there yet so there will be the odd issues still around!
My recommendation? Take a look at swapping out waterline for a different ORM. Keep the flexibility Sails gives you and change out the component that doesn't meet your criteria. There are still many benefits to Sails you can utilise.
Vogels might be worth checking out: https://github.com/ryanfitz/vogels
Turning off waterline: Is there a way to disable waterline and use a different ORM in sails.js?

How to evolve akka-persistence events in production?

Let's say we have design our system using akka-persistence. Now, we have events stored in the event-store. While, the system in production, a new feature was requested. As a result, we find the best way to go about it is to add or modify a field in an event. Let's say by changing the field name or type.
Now, we have two versions of the event, the one we have in production, and the one in the new deployment, which are not compatible. We will fail if we try to recover data from the old version.
What is the best way to go about that, other than data migration?
This is definitely one of the bigger issues with using akka persistence in production. There has been a lot of discussion about this on the akka-user mailing list.
I would say that as long as the new feature requires just additional information, going with a serialization format that allows limited schema evolution such as google protocol buffers or json would be a solution.
If the new feature requires changing the existing data, there is nothing you can do but to do data migration.

How to have complete offline functionality in a web app with PostgreSQL database?

I would like to give a web app with a PostgreSQL database 100% offline functionality. In an ideal case the database should be completely replicated in the browser per user, and synchronized when online. So that the same code can be used to talk to both the offline and online database. I know this is possible with PouchDB and CouchDB, but have not found a solution that works with PostgreSQL. Is this at all possible?
Short answer: I don't know of anything like this that currently exists.
However, in theory, this could be made to work...(long answer:)
Write a PostgreSQL backend for levelup (one exists for MySQL: https://github.com/kesla/mysqldown)
Wire up pouch-server to read/write from your PostgreSQL db using pouchdb's existing leveldb adapter (which in turn will have to be configured to use your postgres backend). Congrats, you can now sync data using PouchDB!
Whether an approach like this is practical in reality for your application is a different question you'll have to answer.
You may be wondering, for example, "will I be able to sync an existing complex schema with multiple tables to the client with this approach?" The answer is probably not - the mysqldown implementation of leveldown uses a single MySQL table with three fields: id, key, and value (source), and I imagine any general-purpose PostgreSQL adapter would be similar (nothing says you can't do a special-purpose adapter just for your app though!).
On the other hand, if you were to implement a couchdb-compatible API (or a subset- you may not need attachments, for example) over your existing database schema, there's nothing stopping you from using PouchDB on the client to talk directly to that as if it were an actual CouchDB - just pop in the URL and call replicate()! Implementing the replication protocol might be a fair bit of work, since you'd need to track revisions and so on somewhere - but again, technically not impossible!
There are also implementations of levelup's backend storage that are designed for browsers. See level.js, which could be another way to sync between a server-side Postgres levelup backend and the browser.
TL;DR: There's tons of work being done around Javascript databases right now. Is syncing with Postgres impossible? probably not. Would it be a lot of work? Definitely. Worth it? Who knows, but it would be cool.
Without installing PostgreSQL on the client? No. Obviously you can cache data for offline use, but an entire RDBMS+procedural languages in Javscript, no.

SOA: Joining data across multiple services

Imagine we have 2 services: Product and Order. Based on my understanding of SOA, I know that each service can have its own data store (a separate database, or a group of tables in the same database). But no Service is allowed to touch the data store of another Service directly.
Now, imagine we have stored the product and order data independently inside Product and Order Services. In the Order Service, we can identify products by their ID.
My question is: With this architecture, how can I display the list of orders and product details on the "same" page?
My understanding is that I should get the list of OrderItems from OrderService. Each OrderItem has a ProductID. Now, if I make a separate call to ProductService to retrieve the details about each Product, that would be very inefficient.
How would you approach this problem?
Cheers,
Mosh
I did some research and found 2 different solutions for this.
1- Services can cache data of other Services locally. But this requires a pub/sub mechanism, so any changes in the source of the data should be published so the subscribing Services can update their local cache. This is costly to implement, but is the fastest solution because the Service has the required data locally. It also increases the availability of a Service by preventing it from being dependent to the data of other Services. In other words, if the other Service is not available, it can still do its job by its cache data.
2- Alternatively, a Service can query a "list" of objects from another Service by supplying a list of identifiers. This prevents a separate call to be made to the target service to get the details of a given object. This is easier to implement but performance-wise, is not as fast as solution 1. Also, in case the target Service is not available, the source Service cannot do its job.
Hope this helps others who have come across this issue.
Mosh
DB integration (which is really what you are talking about when two services share a table in a DB) is wrong at so many levels!
It completely breaks some of the major principles of software engineering
loose coupling,
encapsulation
separation of concerns
A service should be (to earn that name) completely independent, namely:
it must not rely on others to ensure the consistency and coherence of its data
it must not rely on others to guaranty the security of its data
it must not depend on external implementations (only interfaces)
Two services that share data at the DB level are unable to guarantee any of the former.
The fact that you "control" both services is completely irrelevant. Today you control... tomorrow you might want to outsource or replace one of the services. That should be as simple as ensuring the proper interfaces are in place.
Imagine both services that share a table with some field (varchar) in it. Now one service needs to change that field to numeric... bang the other service stops functioning - loose coupling goes down the drain.
Most of the time the trick lies in properly defining the service scope and clearly stipulating what a service do and what it doesn't do. You should also avoid turning everything into a service. Set your service granularity to high and services will start popping everywhere and integration headaches will escalate.
That being said, there are some situations where data integration between services poses some challenges. The main premiss do, should always be - data can belong only to one service. Data is intrinsically tied to business logic that affects data consistency and coherence and as such there should never be more than one service controlling any given data.
Another approach would be to have some sort of data source that lives outside of the SOA services. This data source could be considered your cache of the data, your operational data source or even a data warehouse. Extraction packages can export the data from the services (and/or some sort of real time mechanism). You can query this data source how you want.
The advantage of this approach is that the SOA black box is maintained and you can swap out a service knowing how you have coupled it.
Disadvantage is the added complexity and maintenance overhead.
SOA is just a buzz-phrase for deploying components behind web services. How many data stores you have is entirely up to you. In some cases it makes sense to have partitioned data behind individual components, and in other cases all the data lives behind one service, and in other cases many components that expose service interfaces connect to the same database via the database's connection protocol. Approach the problem by approaching the problem, not be imposing artificial constraints.
I don't think there is any principle in SOA that services should have separate data store. In general it is actually impractical. Yes you can have product and order service and the client can do the join using web service call as you said and this may be acceptable in some scenario. But that doesn't mean that you cannot have a specific service for a client if you already know the client's behaviour and performance requirements.
What I mean is that you should have a search service that returns orders and products with the join done in database. This is practical and would solve your business problem.
It is unfortunate to see this whole discussion being deteriorating in a "can I use a shared database or not in SOA" statement quest, which is totally irrelevant and does not help answering the original question at all.
More then often in a real world situation the data is already stored in different systems to start with. Customer data for example is coming from CRM, product data from SAP, contract data from yet another different source.
It is not a quest for getting this data technically together, rather than an understanding there is only one source of the data. To differently put it, there is only one owner of the data within your enterprise, who is solely responsible for maintaining it and ensuring the correct data quality.
Storing data locally for performance reasons means replicating data, which is more than often a dangerous venture, unless you have a solid caching strategy in place. I think Mosh has given some sensible answers when faced with an existing application landscape.

How do we share data between two different services

I am currently working on a web service which is periodically polled. It does not store its state and is instantiated everytime it is queried. Essentially, it retrieves the state of other external entities e.g. databases and delivers it back to the requester.
Recently, the need to store state as arisen in that
There is the need to continously collect data from a particular source and store the bits that are important/relevant
There is the need to collect the aggregate of a particular data source over a period of time
I came up with the following idea:
My main concern here is the fact that I am using a static class (essentially a global) to share data between the two services. Is there a better way to doing this?
edit: Thanks for the responses thus far. Apologies for the vaguesness of this question: just trying to work out what is the best way to share data across different services and am unsure as to the specifics (i.e. what is required). The platform that I am developing on is the .NET framework and both services are simply WCF services hosted as a Windows service.
The database route sounds like the most conventional way to go - however I am reluctant to go down that path for now (mainly for deployment/setup issues; it introduces the need to create new tables, etc in addition to simply installing the software) for at this point the transfer of relatively small amounts of data. This may of course change in the future and going the database route might be the way to go at that point.
Is there any other way besides adding a database persistance layer?
If you need to collect and aggregate data, you might want to consider using a database between the two layers. Or have I misunderstood something?
You should consider enhancing your question with more requirements: pretty much all options are open here.
Sure - how about data binding? I don't have a lot of information to go on here - about your platform but most sufficiently advanced systems offer it in some form.
You could replace your static shared data with some database representation, with a caching layer (like memcached) between the database and the webservice, so that most of the time the data is available very quickly from the cache, but can be retrieved from the database as needed.
I appreciate that you want to keep the architecture simple. Depending on the magnitude of items you have to look up and there permanency, you might just consider leveraging your file system or a message queue. It sounds like you want a file system, because that sounds the least amount of impact to your design.
If you start dealing with tens of thousands of small files, your directories could get hard to navigate and slow to do file lookups on. I typically shoot for about 1000 - 10000 files per directory, and concoct a routine that can generate a path to the file depending on the file name pattern. Keeping the number of subdirectories even is important, some file systems have a limit on the number of subdirectories in a parent directory.