Does this make sense for Orleans or SF and if so guidance please - azure-service-fabric

We’re working to take our software to Azure cloud and looking at Orleans and Service Fabric (SF) as potential frameworks. We need to:
Populate our analysis engines with lots of data (e.g., 100MB to 2GB) per engine instance.
Maintain that state, and if an engine instance goes idle for say 20 minutes or more, we’d like to unload it (i.e., and not pay for the engine instance resource).
Each engine instance will support one to several end users with a specific data set.
Each engine instance can be highly interactive generating lots of plot data near realtime. We’re maintaining state as we don’t want to pay the price to populate engine instance for each engine interaction.
An engine instance action can take a few seconds, a few minutes, to even tens of minutes. We’ll want some feedback.
Users may access an engine instance every few seconds (e.g., to steer the engine towards a result based on feedback) and will want live plot data.
Each user will want to talk to a specific engine instance.
As a user expresses interest in running a simulation (i.e., standing up an engine instance), ideally we want him to choose small/medium/large computing resource to run his engine instance (i.e., based on the problem he’s trying to solve he may want more or less computing/memory power).
We’re considering Orleans and SF but we’re having difficulty specifying architecture based on above requirements. We’ve considered:
Trying to think about an SF partition, or an Orleans silo as an ‘engine instance’ described above.
Leveraging both Orleans and SF notion of fault tolerance through replication.
Leveraging local (i.e., to partition or silo) storage to store results and maintain state (i.e., for long periods or until idle for 20 minutes).
We’ve not understood how to:
Limit a silo or a partition to a single engine instance so that we can control resourcing of the engine instance.
Keep a user’s engine instance data separate from another users engine instance data.
Direct a request from a user (e.g., through a web API) to a particular engine instance.
Does this make sense for Orleans, does it make more sense for SF? Any pointers on how to implement the above would be helpful.

When you say SF I assume you mean SF Actors right?
You can use them the way you want, but in both cases does not look as the right solution for your problem, because:
Actors are single threaded, if you plan to share the same instance with multiple clients, each one would have to wait for the previous one to finish before it start processing anything. If you need to monitor the status of a running actor, you would have to make the actor publish the updates to external subscribers.
Actor state is isolated, so you can't access the state of other actors, the way to do it is provide a method to return it, but if the actor is running a command you have to wait the completion, unless you make a separate state service to hold the processed data.
You can't limit the resources required for a actor, in service fabric you specify the resources needed for a service, but you can't do it for actors, and you can't limit the resources they use, when they hit the limit, service fabric will try to balance the resources for your, but nothing prevent the process to consume more memory than requested.
Both actor services communicates using the ask approach, so they will "block" the caller waiting for an answer, it is asynchronous but you still have to keep the caller 'waiting'. (block and wait is because there is not an idea of fire and forget like Akka that uses the Tell approach, where it delivery the message and forget.)
Based on some of your requirements, I think a containers would be a better approach. Because:
You can limit the resource consumption for each container
The data is isolated inside the container and not visible to others
But on containers you have to manage the replication and partitioning by yourself, so in this case I would recommend the best of both worlds:
Create SF services to host the shared data sets between the the users
SF Service+Actor to only store the results of users simulations.
Containers to run the simulations and send updates to actors
This is just an example, it all will depend on your requirements, architecture and how data will be isolated from each other.

Related

Firebase centered architecture

This is an architectural question.
I am currently in the process of designing a web application and I am used to a basic: frontend, api, database, microservices setup.
For the sake of saving money and making my architecture a little bit more modern than what I am used to I decided to look into serverless.
The two main parts I am interested in are google cloud functions and firebase. My understanding is that google cloud functions can be fired when a database entry in firebase has been manipulated.
The way I used to communicate between services was through message queues such as RabbitMQ but it seems to me that by using firebase and cloud functions you can build communication through the database without the need for message queues. What I mean by communication in this case, would be that one service would be able to react to the execution of another service by seeing that an entry in the database was changed.
My question therefore is, what are the upsides and downsides of letting all your "communication" between microservices run through firebase instead of message queues, and is this an architecture that is generally used?
AFAIK, cloud function triggers is a beta feature in Firebase, and according to the doc, there are some limitations for firestore trigger events:
It can take up to 10 seconds for a function to respond to changes in Cloud Firestore.
Ordering is not guaranteed. Rapid changes can trigger function invocations in an unexpected order.
Events are delivered at least once, but a single event may result in multiple function invocations. Avoid depending on exactly-once mechanics, and write idempotent functions.
Cloud Firestore triggers for Cloud Functions is available only for Cloud Firestore in Native mode. It is not available for Cloud Firestore in Datastore mode.
The most concerning limitation here is the first one. 10 seconds for an update is a long time if you need that update to be visible to the user.
Another disadvantage I see is that it may run out of control (in terms of system design) as the complexity increases. You may be tempted to add events for everything, and it may be hard to partition them by category, for example (in message queues, you can use topics for that).
Also, according to the doc, cloud functions are rate-limited to 16 invocations per 100 seconds, which may quickly be reached if you got some traffic on your app.
I would use trigger-events for isolated scenarios and use a message queue for the backbone communication between microservices.

Getting Beyond 50 Replica Set Members in Mongodb

I’m looking to build a distributed Access Control system for a microservice platform. I’m considering using Mongodb as my database technology. My system design objectives are as follows:
Policy Enforcement should be distributed - If any given Policy
Enforcement Point (PEP) experiences downtime, only the application
that the PEP serves should be affected.
Policy Decisions should be
distributed - We don’t want the whole platform to experience downtime
because a central Policy Decision Point (PDP) is experiencing
downtime. We only want it to affect the application that it serves.
Policy Administration should be centralized - Creating a centralized
policy administration interface provides the ability for any system
(including a UI) to understand what rights an individual has, and by
establishing a common interface it allows us to more easily audit
changes to access across a whole platform.
Policy Information (context) is distributed - We don’t get to choose this if we are
building a distributed microservice platform. We can centralize the
retrieval of additional context by aggregating data that is needed to
make access control decisions into a single place, but the data
sources are still distributed.
I’m considering building a system like the one shown below. The idea is that Access Policies are administered by a central Policy Admin API. This API manages Policies that are persisted to a mongodb cluster with a 3 member replica set backing it. I would like other APIs in the platform to have a dedicated policy-query-api (Policy Decision Point) that is deployed along side it to make Access Control decisions pertinent to the API. The idea is that if any one of the policy-query-apis goes down, only the API that it serves will be affected.
I want changes to Policies to be governed by the Policy Admin API and I would like the changes to be replicated across each mongo instance that is used by each of the policy-query-apis.I don’t want the mongo replicas for each policy-query-api to affect a write to the primaries.
I also don’t need immediate data consistency (less than 5 sec latency), but I would like the data replication to be handled at the database layer if possible. The technology is already built to handle this and I don’t want to reinvent the wheel at the application layer if possible.
I’ve looked at the documentation on Replica Set Members and I’ve pretty thoroughly reviewed the documentation on Replica Sets in Mongo. It seems like having a Hidden Member or Delayed Member would be a good fit for my use case. Do you agree? Also, I’m concerned about the 50 member replica set limit 1. Since each one of these replicas would serve an API in my platform, if there exceeded more than 50 microservices (which is quite likely) how would I manage replication like this?
Just so that I understand, you are asking about:
one standalone (?? your picture suggests standalone but you are asking about 50 node RS limit) node per application, data mirrored to standalone from the master RS
the application only queries its local standalone
MongoDB provides read preference nearest for the use case of reading data from local nodes. Importantly the nearest read preference still provides availability if your local node is unavailable - the next closest (roughly) node will be used in this case. Your proposed architecture would take the application down every time its local database node needs to be restarted for version upgrades.
You may also look into tag sets.
Additionally, MongoDB allows specifying priorities on nodes for election purposes. If you put all of your MongoDB nodes into the same RS, you can use priorities to have one of the 3 designated "main" servers be primaries if any of them are available.

Behaviour when reducing instances of a Bluemix application

I have an orchestrator service which keeps track of the instances that are running and what request they are currently dealing with. If a new instance is required, I make a REST call to increase the instances and wait for the new instance to connect to the orchestrator. It's one request per instance.
The orchestrator tracks whether an instance is doing anything and knows which instances can be stopped, however there is nothing in the API that allows me to reduce the number of instances stopping a particular instance, which is what I am trying to achieve.
Is there anything I can do to manipulate the platform into deterministically stopping the instances that I want to stop? Perhaps by having long running HTTP requests to the instances I require and killing the request when it's no longer required, then making the API call to reduce the number of instances?
Part of the issue here is that I don't know the specifics of the current behavior...
Assuming you're talking about CloudFoundry/Instant Runtime applications, all of the instances of an applications are running behind a load balancer which uses round-robin to distribute requests across the instances (unless you have session affinity cookie set up). Differentiating between each instances for incoming requests or manual scaling is not recommended and it's an anti-pattern. You can not control which instance the scale down task will choose.
If you really want that level of control with each instance, maybe you should deploy them as separate applications. MyApp1, MyApp2, MyApp3, etc. All of your applications can have the same route (myapp.mybluemix.net). Each of the applications can now distinguish themselves by their name (VCAP_APPLICATION) allowing you terminate them.

How to monitor (micro)services?

I have a set of services. Every service contains some components.
Some of them are stateless, some of them are stateful, some are synchronous, some are asynchronous.
I used different approaches to monitoring and alerting.
Log-based alerting and metrics gathering. New Relic based. Own bicycle.
Basically, atm I am looking for a way, how to generalize and aggregate important metrics for all services in single place. One of things, I want is that we monitor more products, than separate services.
As an end result I see it as a single dashboard with small amount of widgets, but looking at those widgets I would be able to say for sure, if services are usable to end-customer.
Probably someone can recommend me some approach/methodology. Or give a reference to some best practices.
I like what you're trying to achieve! A service is not production-ready unless it's thoroughly monitored.
I believe what your're describing goes into the topics of health-checking and metrics.
... I would be able to say for sure, if services are usable to end-customer.
That however will require a little of both ;-) To ensure you're currently fulfilling your SLA, you have to make sure, that your services are all a) running and b) perform as requested. With both problems I suggest to look at the StatsD toolchain. Initially developed by Etsy, it has become the de-facto standard for gathering metrics.
To ensure all your services are running, we're relaying Kubernetes. It takes our description for what should run, be reachable from outside etc. and hosts that on our infrastructure. It also makes sure, that should things die - that they will be restarted. It helps with things like auto-scaling etc. as well! Awesome tooling and kudos to Google!
The way it ensures that is with health-checks. There are multiple ways how you can ensure your service node booted by Kubernetes is alive and kicking (namely HTTP calls and CLI scripts but this should be a modular thing should you need anything else!) If Kubernetes detects unhealthy nodes it will immediately phase them out and start another node instead.
Now, making sure, all your services perform as expected you'll need to gather some metrics. For all of our services (and all individual endpoints), we gather a few metrics via StatsD like:
Requests/sec
number of errors returned (404, etc...)
Response times (Average, Median, Percentiles depending on the services SLA)
Payload size (Average)
sometimes the number of concurrent requests per endpoint, the number of instances currently running
general metrics like the hosts current CPU and memory usage and uptime.
We gather a lot more metrics but that's about the bottom line. Since StatsD has become more of a "protocol specification" than a concrete product there are a myriad of collector, front- and backends to choose from. They help you visualize your systems state and many of them feature alerts of something or some combination of metrics go beyond their thresholds.
Let me know, if this was helpfull!
There's at least 3 types of things you will need to monitor: the host where the service is deployed, the component itself and the SLAs and some of them depend on the software stack you're using as well as the architecture.
With that said, you could for example use Nagios to monitor the hardware where the services are deployed, Splunk for the services metrics/SLAs as well as for any errors that might occur. You can also use SNMP packages in case something goes wrong and you have a more sophisticated support structure, this would be yours triggers. Without knowing how your infrastructure/services are set up it is complicated to go into deeper details.

Interprocess messaging - MSMQ, Service Broker,?

I'm in the planning stages of a .NET service which continually processes incoming messages, which involves various transformations, database inserts and updates, etc. As a whole, the service is huge and complicated, but the individual tasks it performs are small, simple, and well-defined.
For this reason, and in order to allow for easy expansion in future, I want to split the service into several smaller services which basically perform part of the processing before passing it onto the next service in the chain.
In order to achieve this, I need some kind of intermediary messaging system that will pass messages from one service to another. I want this to happen in such a way that if a link in the chain crashing or is taken offline briefly, the messages will begin to queue up and get processed once the destination comes back online.
I've always used message queuing for this type of thing, but have recently been made aware of SQL Service Broker which appears to do something similar. Is SQLSB a viable alternative for this scenario and, if so, would I see any performance benefits by using that instead of standard Message Queuing?
Thanks
It sounds to me like you may be after a service bus architecture. This would provide you with the coordination and fault tolerance you are looking for. I'm most familiar and partial to NServiceBus, but there are others including Mass Transit and Rhino Service Bus.
If most of these steps initiate from a database state and end up in a database update, then merging your message storage with your data storage makes a lot of sense:
a single product to backup/restore
consistent state backups
a single high-availability/disaster recoverability solution (DB mirroring, clustering, log shipping etc)
database scale storage (IO capabilities, size and capacity limitations etc as per the database product characteristics, not the limits of message store products).
a single product to tune, troubleshoot, administer
In addition there are also serious performance considerations, as having your message store be the same as the data store means you are not required to do two-phase commit on every message interaction. Using a separate message store requires you to enroll the message store and the data store in a distributed transaction (even if is on the same machine) which requires two-phase commit and is much slower than the single-phase commit of database alone transactions.
In addition using a message store in the database as opposed to an external one has advantages like queryability (run SELECT over the message queues).
Now if we translate the abstract terms 'message store in the database as being Service Broker and 'non-database message store' as being MSMQ, you can see my point why SSB will run circles any time around MSMQ.
My recent experiences with both approaches (starting with Sql Server Service Broker) led me to the situation in which I cry for getting my messages out of SQL server. The problem is quasi-political but you might want to consider it: SQL server in my organisation is managed by a specialized DBA while application servers (i.e. messaging like NServiceBus) by developers and network team. Any change to database servers requires painful performance analysis from DBA and is immersed in fear that we might get standard SQL responsibilities down by our queuing engine living in the same space.
SSSB is pretty difficult to manage (not unlike messaging middleware) but the difference is that I am more allowed to screw something up in the messaging world (the worst that may happen is some pile of messages building up somewhere and logs filling up) and I can't afford for any mistakes in SQL world, where customer transactional data live and is vital for business (including data from legacy systems). I really don't want to get those 'unexpected database growth' or 'wait time alert' or 'why is my temp db growing without end' emails anymore.
I've learned that application servers are cheap. Just add message handlers, add machines... easy. Virtually no license costs. With SQL server it is exactly opposite. It now appears to me that using Service Broker for messaging is like using an expensive car to plow potato field. It is much better for other things.