How to register interest in a Gemfire peer-peer configuration using a replicated region - spring-data-gemfire

I have a Peer-Peer gemfire topology with roughly 15 peers. I use Spring-data-gemfire to initialize the gemfire context and all regions are Replicated for fastest possible access.
Each peer only needs access to a small subset of all the Gemfire regions. I would like each peer to register interest only in the regions it need and avoid all unnecessary traffic. Is there a way to do this using Spring-data-gemfire?
Versions used:
Spring 3.2.1
Gemfire: 6.6.3.2
Spring-data-gemfire: 1.2.2

"Each peer only needs access to a small subset of all the Gemfire regions."
By "small subset" do you mean data or in fact, just certain "REPLICATE" Regions?
If the later, then only configure a peer member to have the REPLICATE Region (e.g. X) that needs that Region (e.g. X). For instance...
Member A - REPLICATE Region X, Y, Z
Member B - REPLICATE Region X, Y
Member C - REPLICATE Region Z
Then, only those members having the REPLICATE Region will actually receive (all) data and events for that Region.
If you want to control the actual contents of the Region (i.e. data) then know that, by default, a REPLICATE Region is a full replicate (all or nothing strategy), in that it distributes all data/events to all members hosting the Region. Or in the GemFire UG's own words...
"Replicated regions always receive all events from peers and require no further configuration."
(http://gemfire.docs.pivotal.io/latest/userguide/developing/events/configure_p2p_event_messaging.html)
Also note... there is no "Register Interest" feature for peers. Register Interests is between client and server. Another client/server "interests" type option is CQs.
However, you can create a Region with a different Data Policy and use Subscription to get only the data you want. E.g. ...
<gfe:partitioned-region id="X" ...>
<gfe:subscription type="CACHE_CONTENT"/>
</gfe:partitioned-region>
Technically, you can used with as well, but I am uncertain what happens in this scenario, actually.
Also, I am not certain about the finer grained controls that client Register Interests and CQs give you when using peer Subscription in GemFire. My knowledge is limited in this area.
See...
http://gemfire.docs.pivotal.io/latest/userguide/developing/distributed_regions/chapter_overview.html
And...
http://gemfire.docs.pivotal.io/latest/userguide/developing/events/how_cache_events_work.html
As well as the link above for more details.
Hope this helps.

Related

Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).
There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).
I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

too many rest api calls in Microservices

Say there are two services,
service A and service B.
Service A needs data from service B to process a request. So as to avoid tight coupling we make a rest API call to the service B instead of directly querying service B's database.
Doesn't making an HTTP call to the service B for every request reduces the response time?
I have seen the other solution to cache the data at service A. I have following questions.
What if the data is rapidly changing?
what if the data is critically important such as user account balance details and there has to be strong consistency.
what about data duplication and data consistency?
By introducing the rest call arent are we introducing a point of failure? what if service B is down?
Also by the increasing requests to service A for that particular API, service B load is also increasing.
Please help me with this.
These are many questions at once, let me try to give a few comments in random order:
If Service A needs data from service B, then B is already a single point of failure, so the reliability question is just moved from B's database to B's API endpoint. It's very unlikely, that this makes a big difference.
A similar argument goes for the latency: A good API layer including caching might even decrease average latency.
Once more the same with load: The data dependency of A on B already includes the load on B's database. And again a good API layer with caching might even help with the load.
So while the decoupling (from tight to loose) brings a lot of advantages, load and reliability are not necessarily in the disadvantages list.
A few words about caching:
Read caching can help a lot with load: Typically a request from A to B should indicate the version of the requested entity, that is available in the cache (possibly none of course), Endpoint B then can just verify if the entity has changed and if not stop all processing and just return an "unchanged" message. B can keep the information, which entities have changed in the immediate past in a much smaller data store than the entities themselves, most likely keeping them in RAM or even in process, speeding up things quite noticeably.
Such a mechanism can much easier be introduced in an API endpoint for B then in the database itself, so querying the API can scale much better than querying the DB.
I guess the first question you should ask yourself is are A and B really two different services - what's the reason for partitioning them in the first place? After all, they seem to be coupled both temporally and by data.
one of the reasons to separate a service into two executables might be the can change independently or serve different access paths, in which case you may want to consider them different aspects of the same service - now this may seem like a distinction without a difference, but it is important when looking at the whole picture and which parts of the system can know about internal structures of others and defending the system into deteriorating to a big ball of mud where every "service" can access any other "service" data and they are all dependent on each other
If these two components are indeed different services, you may also consider moving to a model where service B published data changes actively. This way service A can cache the relevant parts of B's data. B is still the source of truth and A is decoupled from B's availability (depending on the expiration of data)

How to retrieve data from another bounded context in ddd?

Initially, There is an app runs in Desktop, however, the app will run in web platform in the future.
There are some bounded contexts in the app and some of them needs to retrieve data from another. In this case, I don't know which approach I have to use for this case.
I thought of using mediator pattern that a bound context "A" requests data "X" and then mediator call another bound context, like B" " and gets the correct data "X". Finally, The mediator brings data "X" to BC "A".
This scenario will be change when the app runs in web, then I've thought of using a microservice requests data from another microservice using meaditor pattern too.
Do the both approaches are interest or there is another better solution?
Could anyone help me, please?
Thanks a lot!
If you're retrieving data from other bounded contexts through either DB or API calls, your architecture might potentially fall to death star pattern because it introduces unwanted coupling and knowledge to the client context.
A better approach might be is looking at event-driven mechanisms like webhooks or message queues as a way of emitting data that you want to share to subscribing context(s). This is good because it reduces coupling of your bounded context(s) through data replication across contexts which results to higher bounded contexts independence.
This gives you the feeling of "Who cares if bounded context B is not available ATM, bounded context A and C have data they need inside them and I can resume syncing later since my data update related events are recorded on my queue"
The answer to this question breaks down into two distinct areas:
the logical challenge of communicating between different contexts, where the same data could be used in very different ways. How does one context interpret the meaning of the data?
and the technical challenge of synchronizing data between independent systems. How do we guarantee the correctness of each system's behavior when they both have independent copies of the "same" data?
Logically, a context map is used to define the relationship between any bounded contexts that need to communicate (share data) in any way. The domain models that control the data are only applicable with a single bounded context, so some method for interpreting data from another context is needed. That's where the patterns from Evan's book come in to play: customer/supplier, conformist, published language, open host, anti-corruption layer, or (the cop-out pattern) separate ways.
Using a mediator between services can be though of as an implementation of the anti-corruption layer pattern: the services don't need to speak the same language, because there's an independent buffer between them doing the translation. In a microservice architecture, this could be some kind of integration service between two very different contexts.
From a technical perspective, direct API calls between services in different bounded contexts introduce dependencies between those services, so an event-driven approach like what Allan mentioned is preferred, assuming your application is okay with the implications of that (eventual consistency of the data). Picking a messaging platforms that gives you the guarantees necessary to keep the data in sync is important. Most asynchronous messaging protocols guarantee "at least once" delivery, but ordering of messages and de-duplication of repeats is up to the application.
Sometimes it's simpler to use a synchronous API call, especially if you find yourself doing a lot of request/response type messaging (which can happen if you have services sending command-type messages to each other).
A composite UI is another pattern that allows you to do data integration in the presentation layer, by having each component pull data from the relevant service, then share/combine the data in the UI itself. This can be easier to manage than a tangled web of cross-service API calls in the backend, especially if you use something like an IT/Ops service, NGINX, or MuleSoft's Experience API approach to implement a "backend-for-frontend".
What you need is a ddd pattern for integration. BC "B" is upstream, "A" is downstream. You could go for an OHS PL in upstream, and ACL in downstream. In practice this is a REST API upstream and an adapter downstream. Every time A needs the data from B , the adapter calls the REST API and adapts the info returned to A domain model. This would be sync. If you wanna go for an async integration, B would publish events to MQ with the info, and A would listen for those events and get the info.
I want to add-on a comment about analysis in DDD. Exist e several approaches for sending data to analytic.
1) If you have a big enterprise application and you should collect a lot of statistic from all bounded context better move analytic in separate service and use a message queue for send data there.
2) If you have a simple application separate your Analytic from your App in other context and use an event or command to speak with there.

SOA Service Boundary and production support

Our development group is working towards building up with service catalog.
Right now, we have two groups, one to sale a product, another to service that product.
We have one particular service that calculates if the price of the product is profitable. When a sale occurs, the sale can be overridden by a manager. This sale must also be represented in another system to track various sales and the numbers must match. The parameters of profitability also change, and are different from month to month, but a sale may be based on the previous set of parameters.
Right now the sale profitability service only calculates the profit, it also provides a RESTful URI.
One group of developers has suggested that the profitability service also support these "manager overrides" and support a date parameter to calculate based on a previous date. Of course the sales group of developers disagree. If the service won't support this, the servicing developers will have to do an ETL between two systems for each product(s), instead of just the profitability service. Right now since we do not have a set of services to handle this, production support gets the request and then has to update the 1+ systems associated for that given product.
So, if a service works for a narrow slice, but an exception based business process breaks it, does that mean the boundaries of the service are incorrect and need to account for the change in business process?
Two, does adding a date parameter extend the boundary of the service too much, or should it be excepted that if the service already has the parameters, it would also have a history of parameters as well? At this moment, we don't not have a service that only stores the parameters as no one has required a need for it.
If there is any clarification needed before an answer can be given, please let me know.
I think the key here is: How much pain would be incurred by both teams if and ETL was introduced between to the two?
Not that I think you're over-analysing this, but if I may, you probably have an opinion that adding a date parameter into the service contract is bad, but also dislike the idea of the ETL.
Well, strategy aside, I find these days my overriding instinct is to focus less on the technical ins and outs and more on the purely practical.
At the end of the day, ETL is simple, reliable, and relatively pain free to implement, however it comes with side effects. The main one is that you'll be coupling changes to your service's db schema with an outside party, which will limit options to evolve your service in the future.
On the other hand allowing consumer demand to dictate service evolution is easy and low-friction, but also a rocky road as you may become beholden to that consumer at the expense of others.
Another possibility is to allow the extra service parameters to be delivered to the consumer via a message, rather then across the same service. This would allow you to keep your service boundary intact and for the consumer to hold the necessary parameters local to themselves.
Apologies if this does not address your question directly.

UML Deployment Diagram for IaaS and PaaS Cloud Systems

I would like to model the following situation using a UML deployment diagram.
A small command and control machine instance is spawned on an Infrastructure as a Service cloud platform such as Amazon EC2. This instance is in turn responsible for spawning additional instances and providing them with a control script NumberCruncher.py either via something like S3 or directly as a start up script parameter if the program is small enough to fit into that field. My attempt to model the situation using UML deployment diagrams under the working assumption that a Machine Instance is a Node is unsatisfying for the following reasons.
The diagram seems to suggest that there will be exactly three number cruncher nodes. Is it possible to illustrate a multiplicity of Nodes in a deployment diagram like one would illustrate a multiplicity of object instances using a multi-object. If this is not possible for Nodes then this seems to be a Long Standing Issue
Is there anyway to show the equivalent of deployment regions / data-centres in the deployment diagram?
Lastly:
What about Platform as a Service? The whole Machine Instance is a Node idea completely breaks down at that point. What on earth do you do in that case? Treat the entire PaaS provider as a single node and forget about the details?
Regarding your first question:
Is there anyway to show the equivalent of deployment regions /
data-centres in the deployment diagram?
I generally use Notes for that.
And your second question:
What about Platform as a Service? The whole Machine Instance is a Node
idea completely breaks down at that point. What on earth do you do in
that case? Treat the entire PaaS provider as a single node and forget
about the details?
I would say, yes for your last question. And I suppose you could take more details from the definition of the deployment model and its elements. Specially at the end of this paragraph:
They [Nodes] can be nested and can be connected into systems of arbitrary
complexity using communication paths. Typically, Nodes represent
either hardware devices or software execution environments.
and
ExecutionEnvironments represent standard software systems that
application components may require at execution time.
source: http://www.omg.org/spec/UML/2.5/Beta1/