Application Performance monitoring on Swisscom Application Cloud

Application Performance monitoring on Swisscom Application Cloud - swisscomdev

I am investigating options for monitoring our installation in Swisscom's cloud-foundry. My objectives are the following:
monitor performance indicators for deployed application (such as cpu, disk, memory)
monitor performance indicators for services (slow queries, number of queries, ideally also some metrics on hitting quotas)
So far, I understand the options are the following (including some BUTs):
I used a very nice TOP cf-plugin (github)
This works very well. It seems that it registers itself to get the required firehose nozzles and consume data.
That is very useful for tracing / ad-hoc monitoring, but not very good for a serious infrastructure monitoring.
Another way I found is to use firehose-syslog solution.
This can be deployed as an app to (as far as I understand) do the job in similar way, as the TOP cf plugin.
The problem is, that it requires registered client, so it can authenticate with the doppler endpoint. For some reason, the top-cf-plugin does that automatically / in another way.
Last option i am considering is to build the monitoring itself to the App (using a special buildpack)
That can be for example done with Datadog. But it seems to also require a dedicated uaa client to register the Nozzle.
I would like to check, if somebody is (was) on the similar road, has some findings.
Eventually I would like to raise the following questions towards the swisscom community support:
is it possible to register uaac client to be able to ingest events through the firehose nozzle from external service? (this requires admin credentials if I was reading correctly)
is there an alternative way to authenticate with the nozzle (for example using a special user and his authentication token?)
is there any alternative to monitor the CF deployments in Swisscom? Eventually, is there a paper, blogpost or other form of documentation, that would be helpful in this respect (also for other users of AppCloud)?

Since it requires admin permissions, we can not give out UAA clients for the firehose.
However, there are different ways to get metrics in context of a user.
CF API
You can obtain basic metrics of a specific app by polling the CF API:
https://apidocs.cloudfoundry.org/5.0.0/apps/get_detailed_stats_for_a_started_app.html
However, since you have to poll (and for each app), it's not the recommended way.
Metrics in syslog drain
CF allows devs to forward their logs to syslog drains; in more recent versions, CF also sends metrics to this syslog drain (see https://docs.cloudfoundry.org/devguide/deploy-apps/streaming-logs.html#container-metrics).
For example, you could use Swisscom's Elasticsearch service to store these metrics and then analyze it using Kibana.
Metrics using loggregator (firehose)
The firehose allows streaming logs to clients for two types of roles:
Streaming all logs to admins (which requires a UAA client with admin permissions) and streaming app logs and metrics to devs with permissions in the app's space. This is also what the cf logs command uses. cf top also works this way (it enumerates all apps and streams the logs of each app).
However, you will find out that most open source tools that leverage the firehose only work in admin mode, since they're written for the platform operator.
Of course you also have the possibility to monitor your app by instrumenting it (white box approach), for example by configuring Spring actuator in a Spring boot app or by including an agent of your favourite APM vendor (Dynatrace, AppDynamics, ...)
I guess this is the most common approach; we've seen a lot of teams having success by instrumenting their applications. Especially since advanced monitoring anyway requires you to create your own metrics as the firehose provided cpu/memory metrics are not that powerful in a microservice world.
However, option 2. would be worth a try as well, especially since the ELK's stack metric support is getting better and better.

Related

GKE and Task Queues

I am working on a cloud service platform that consists of getting tasks from users, executing them, and giving back the results.
TL;DR
Is there a way to have a "task queue", where tasks can be inserted via a REST API, and extracted automatically by the Google Kubernetes Engine cluster by guaranteeing an automatic scaling?
Long description
Users can send tasks in parallel, and each task is time consuming and need to be performed on a GPU. So, setting up an auto-scaling GPU cluster is what I thought of.
More in particular, in my idea, users could send tasks/data through a REST API, the REST API provides in filling a task queue, and the task queue itself will feed tasks to workers on the GPU auto-scaling cluster. Of course, there are other details (authentication, database, storage, etc.) that have to be addressed but are not the point of my question.
For reasons I don't specify here, the project is already started on the Google Cloud Platform, so switching to AWS or other providers is not an option.
For what I understood, things seem a bit different from standard Docker-only clusters in AWS, that is, we have to use the Google Kubernetes Engine (GKE) to setup the auto-scaling cluster, even for "simple" GPU-enabled Docker containers.
By looking at the not-so-exhaustive documentation, I know that queues are used, but what I don't know is whether feeding of tasks to the cluster is automatically handled. Also, the so-called "Task Queue" service has been deprecated.
Thank you!

First I thought Cloud Tasks queues may be the answer to your troubles, but more this post seems to promote Cloud Pub/Sub as a better alternative.

After a quick chat with batch developers, the current solution (before the batch service become public) is to adopt a third-party queue system like Slurm.

How do micro services in Cloud Foundry communicate?

I'm a newbie in Cloud Foundry. In following the reference application provided by Predix (https://www.predix.io/resources/tutorials/tutorial-details.html?tutorial_id=1473&tag=1610&journey=Connect%20devices%20using%20the%20Reference%20App&resources=1592,1473,1600), the application consisted of several modules and each module is implemented as micro service.
My question is, how do these micro services talk to each other? I understand they must be using some sort of REST calls but the problem is:
service registry: Say I have services A, B, C. How do these components 'discover' the REST URLs of other components? As the component URL is only known after the service is pushed to cloud foundry.
How does cloud foundry controls the components dependency during service startup and service shutdown? Say A cannot start until B is started. B needs to be shutdown if A is shutdown.

The ref-app 'application' consists of several 'apps' and Predix 'services'. An app is bound to the service via an entry in the manifest.yml. Thus, it gets the service endpoint and other important configuration information via this binding. When an app is bound to a service, the 'cf env ' command returns the needed info.
There might still be some Service endpoint info in a property file, but that's something that will be refactored out over time.
The individual apps of the ref-app application are put in separate microservices, since they get used as components of other applications. Hence, the microservices approach. If there were startup dependencies across apps, the CI/CD pipeline that pushes the apps to the cloud would need to manage these dependencies. The dependencies in ref-app are simply the obvious ones, read-on.
While it's true that coupling of microservices is not in the design. There are some obvious reasons this might happen. Language and function. If you have a "back-end" microservice written in Java used by a "front-end" UI microservice written in Javascript on NodeJS then these are pushed as two separate apps. Theoretically the UI won't work too well without the back-end, but there is a plan to actually make that happen with some canned JSON. Still there is some logical coupling there.
The nice things you get from microservices is that they might need to scale differently and cloud foundry makes that quite easy with the 'cf scale' command. They might be used by multiple other microservices, hence creating new scale requirements. So, thinking about what needs to scale and also the release cycle of the functionality helps in deciding what comprises a microservice.
As for ordering, for example, the Google Maps api might be required by your application so it could be said that it should be launched first and your application second. But in reality, your application should take in to account that the maps api might be down. Your goal should be that your app behaves well when a dependent microservice is not available.
The 'apps' of the 'application' know about each due to their name and the URL that the cloud gives it. There are actually many copies of the reference app running in various clouds and spaces. They are prefaced with things like Dev or QA or Integration, etc. Could we get the Dev front end talking to the QA back-end microservice, sure, it's just a URL.
In addition to the aforementioned, etcd (which I haven't tried yet), you can also create a CUPS service 'definition'. This is also a set of key/value pairs. Which you can tie to the Space (dev/qa/stage/prod) and bind them via the manifest. This way you get the props from the environment.

If micro-services do need to talk to each other, generally its via REST as you have noticed.However microservice purists may be against such dependencies. That apart, service discovery is enabled by publishing available endpoints on to a service registry - etcd in case of CloudFoundry. Once endpoint is registered, various instances of a given service can register themselves to the registry using a POST operation. Client will need to know only about the published end point and not the individual service instance's end point. This is self-registration. Client will either communicate to a load balancer such as ELB, which looks up service registry or client should be aware of the service registry.
For (2), there should not be such a hard dependency between micro-services as per micro-service definition, if one is designing such a coupled set of services that indicates some imminent issues such as orchestrating and synchronizing. If such dependencies do emerge, you will have rely on service registries, health-checks and circuit-breakers for fall-back.

Is it possible to isolate applications from one another in Service Fabric?

When running a Service Fabric cluster, it would make sense to have multiple applications running in it, but those applications might not be dependant on each other in any way. For example, I can have a CustomerApp in there, and a WikiApp.
Now from a security standpoint, it would be great if the WikiApp could be isolated from the CustomerApp, as a Wiki clearly should not be able to connect to services from an App that is holding customer data. I could put authentication into the services of the CustomerApp itself to allow only calls from authenticated services, but in addition, it would be even better if the WikiApp would not even be able to connect or see the other App and not able to resolve an endpoint adress from the naming service.
So is there a way to really isolate applications from each other in Service Fabric with a platform feature? I could not find anything about it in the documentation, and I also doubt it's possible the way Service Fabric works, but it would be very useful.
And to be clear, I'm really talking about isolating applications (ApplicationTypes) from each other, not services within a single application.

There are some levels of isolation built in:
Application instances have process-level isolation, in that each application instance runs in its own process.
Node isolation is possible, using placement constraints, to "isolate" services from each other by constraining them to run on different nodes.
Container support will be available in the future, where applications and services can run inside containers for further environment and resource isolation.
Services can run under unique user accounts, which you can use to perform authentication yourself at the application level.
But unfortunately there is no fine-grained role-based access mechanism built in to the platform today. So, for example, system-wide operations like running queries to get a list of applications or services or resolving endpoints using the naming service doesn't have any role-based access built in.

How to use kafka and storm on cloudfoundry?

I want to know if it is possible to run kafka as a cloud-native application, and can I create a kafka cluster as a service on Pivotal Web Services. I don't want only client integration, I want to run the kafka cluster/service itself?
Thanks,
Anil

I can point you at a few starting points, there would be some work involved to go from those starting points to something fully functional.
One option is to deploy the kafka cluster on Cloud Foundry (e.g. Pivotal Web Services) using docker images. Spotify has Dockerized kafka and kafka-proxy (including Zookeeper). One thing to keep in mind is that PWS currently doesn't support apps with persistence (although this work is starting) so if you were to go this route right now, you would lose the data in kafka when the application is rolled. Looking at that Spotify repo, it looks like the docker images are generally run without any mounted volumes, so this persistence-less kafka seems like it may be a valid use case (I don't know enough about kafka to say).
The other option is to deploy kafka directly on some IaaS (e.g. AWS) using BOSH. BOSH can be hard if you're seeing it for the first time, but it is the ideal way to deploy any distributed software that you want running on VMs. You will also be able to have persistent volumes attached to your kafka VMs if necessary. Here is a kafka BOSH release which may work.
Once you have your cluster running, you have two ways to integrate your Cloud Foundry applications with it. The simplest is just to provide it to your applications as a "user-provided service", which lets you flow kafka cluster access info to your apps. The alternative would to put a service broker in front of your cluster, which would be especially useful if you have many different people who will be pushing apps that need to talk to the kafka cluster. Rather than you having to manually tell people the access info each time, they can do something simple like cf bind-service SOME_APP YOUR_KAFKA_SERVICE. Here is a kafka service broker along with more info about service brokers in general.

According to the 12-factor app description (https://12factor.net/processes), Kafka should not run as an application on top of Cloud Foundry:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Kafka is often considered a "distributed commit log" and as such carries a large amount of state. Many companies use it to keep all events flowing through their distributed system of micro services for a long (sometimes unlimited) amount of time.
Therefore I would strongly recommend to go for the second option in the accepted answer: Kafka topics should be bound to your applications in the form of stateful services.

How do config tools like Consul "push" config updates to clients?

There is an emerging trend of ripping global state out of traditional "static" config management tools like Chef/Puppet/Ansible, and instead storing configurations in some centralized/distributed tool, of which the main players appear to be:
ZooKeeper (Apache)
Consul (Hashicorp)
Eureka (Netflix)
Each of these tools works differently, but the principle is the same:
Store your env vars and other dynamic configurations (that is, stuff that is subject to change) in these tools as key/value pairs
Connect to these tools/services via clients at startup and pull down your config KV pairs. This typically requires the client to supply a service name ("MY_APP"), and an environment ("DEV", "PROD", etc.).
There is an excellent Consul Java client which explains all of this beautifully and provides ample code examples.
My understanding of these tools is that they are built on top of consensus algorithms such as Zab, Paxos and Gossip that allow config updates to spread almost virally, with eventual consistency, throughout your nodes. So the idea there is that if you have a myapp app that has 20 nodes, say myapp01 through myapp20, if you make a config change to one of them, that change will naturally "spread" throughout the 20 nodes over a period of seconds/minutes.
My problem is: how do these updates actually deploy to each node? In none of the client APIs (the one I linked to above, the ZooKeeper API, or the Eureka API) do I see some kind of callback functionality that can be set up and used to notify the client when the centralized service (e.g. the Consul cluster) wants to push and reload config updates.
So I ask: how is this supposed to work (dynamic config deployment and reload on clients)? I'm interested in any viable answer for any of those 3 tools, though Consul's API seems to be the most advanced IMHO.

You could use cfg4j for that. It's a Java configuration library for distributed services. It supports Consul as one of the configuration sources.

That's a nice question. I can tell how Consul HTTP client works.
I also think initially that it works in the push mechanism but while I was recently exploring Consul, I found that all Consul clients poll server for changes they want to watch. Although it is a bit different polling mechanism, Consul supports blocking queries. These are HTTP requests with a max timeout of 10 mins. This query waits until there is some change on the watched key/folder and return with the latest index. If the index is changed, the client reloads the configuration. For more info : Consul Blocking Query