GKE and Task Queues - kubernetes

I am working on a cloud service platform that consists of getting tasks from users, executing them, and giving back the results.
TL;DR
Is there a way to have a "task queue", where tasks can be inserted via a REST API, and extracted automatically by the Google Kubernetes Engine cluster by guaranteeing an automatic scaling?
Long description
Users can send tasks in parallel, and each task is time consuming and need to be performed on a GPU. So, setting up an auto-scaling GPU cluster is what I thought of.
More in particular, in my idea, users could send tasks/data through a REST API, the REST API provides in filling a task queue, and the task queue itself will feed tasks to workers on the GPU auto-scaling cluster. Of course, there are other details (authentication, database, storage, etc.) that have to be addressed but are not the point of my question.
For reasons I don't specify here, the project is already started on the Google Cloud Platform, so switching to AWS or other providers is not an option.
For what I understood, things seem a bit different from standard Docker-only clusters in AWS, that is, we have to use the Google Kubernetes Engine (GKE) to setup the auto-scaling cluster, even for "simple" GPU-enabled Docker containers.
By looking at the not-so-exhaustive documentation, I know that queues are used, but what I don't know is whether feeding of tasks to the cluster is automatically handled. Also, the so-called "Task Queue" service has been deprecated.
Thank you!

First I thought Cloud Tasks queues may be the answer to your troubles, but more this post seems to promote Cloud Pub/Sub as a better alternative.

After a quick chat with batch developers, the current solution (before the batch service become public) is to adopt a third-party queue system like Slurm.

Related

How can I have a GKE cluster "expire" and delete itself?

We stand up a lot of clusters for testing/poc/deving and its up to us to remember to delete them
What I would like is a way of setting a ttl on an entire gke cluster and having it get deleted/purged automatically.
I could tag the clusters with a timestamp at creation and have an external process running on a schedule that reaps old clusters, but it'd be great if I didn't have to do that- it might be the only way but maybe there is a gke/k8s feature for this?
Is there a way to have the cluster delete itself without relying on an external service? I suppose it could spawn a cloud function itself- but Im wondering if there is a native gke/k8s feature to do this more elegantly
You can spawn GKE cluster with Alpha features. Such clusters exist for one month maximum and then are auto-deleted.
Read more: https://cloud.google.com/kubernetes-engine/docs/concepts/alpha-clusters
Try Cloud Scheduler and hook it up with your build server. Cloud Scheduler supports Http , App Engine , Pub/Sub endpoints.
I don't believe there is a native way to do this, but it doesn't seem unreasonable to use cloud scheduler to every so often trigger a cloud function which looks for appropriately labeled clusters and triggers their deletion via the API.

Could I replace RabbitMQ with native kubernetes messaging queue

I didn't find could we replace rabbitMQ/activeMQ/SQS with native kubernetes messaging queue?
or they are totally different in terms of features?
It is a totally different mechanism.
Kubernetes internal queues is not a real "queues" you can use in external applications, they are a part of internal messaging system and manage only objects which are parts of Kubernetes.
Moreover, Kubernetes doesn't provide any message queue as a service for external apps (except a situation when your app actually service one of K8s objects).
If you are not sure which service is better for your app - try to check queues.io.
That is a list of almost all available MQ engines with some highlights.
If you are referring to the Parallel Processing Using a Work Queue approach, you can technically use any queuing system, because the main logic is in the code used to get the items from the queue, Kubernetes is used only to control the parallelism.
If the idea is to use the queue algorithm used internally by kubernetes. it is not exposed as a a service for external applications, you would have to copy the code and implement in you application.

Application Performance monitoring on Swisscom Application Cloud

I am investigating options for monitoring our installation in Swisscom's cloud-foundry. My objectives are the following:
monitor performance indicators for deployed application (such as cpu, disk, memory)
monitor performance indicators for services (slow queries, number of queries, ideally also some metrics on hitting quotas)
So far, I understand the options are the following (including some BUTs):
I used a very nice TOP cf-plugin (github)
This works very well. It seems that it registers itself to get the required firehose nozzles and consume data.
That is very useful for tracing / ad-hoc monitoring, but not very good for a serious infrastructure monitoring.
Another way I found is to use firehose-syslog solution.
This can be deployed as an app to (as far as I understand) do the job in similar way, as the TOP cf plugin.
The problem is, that it requires registered client, so it can authenticate with the doppler endpoint. For some reason, the top-cf-plugin does that automatically / in another way.
Last option i am considering is to build the monitoring itself to the App (using a special buildpack)
That can be for example done with Datadog. But it seems to also require a dedicated uaa client to register the Nozzle.
I would like to check, if somebody is (was) on the similar road, has some findings.
Eventually I would like to raise the following questions towards the swisscom community support:
is it possible to register uaac client to be able to ingest events through the firehose nozzle from external service? (this requires admin credentials if I was reading correctly)
is there an alternative way to authenticate with the nozzle (for example using a special user and his authentication token?)
is there any alternative to monitor the CF deployments in Swisscom? Eventually, is there a paper, blogpost or other form of documentation, that would be helpful in this respect (also for other users of AppCloud)?
Since it requires admin permissions, we can not give out UAA clients for the firehose.
However, there are different ways to get metrics in context of a user.
CF API
You can obtain basic metrics of a specific app by polling the CF API:
https://apidocs.cloudfoundry.org/5.0.0/apps/get_detailed_stats_for_a_started_app.html
However, since you have to poll (and for each app), it's not the recommended way.
Metrics in syslog drain
CF allows devs to forward their logs to syslog drains; in more recent versions, CF also sends metrics to this syslog drain (see https://docs.cloudfoundry.org/devguide/deploy-apps/streaming-logs.html#container-metrics).
For example, you could use Swisscom's Elasticsearch service to store these metrics and then analyze it using Kibana.
Metrics using loggregator (firehose)
The firehose allows streaming logs to clients for two types of roles:
Streaming all logs to admins (which requires a UAA client with admin permissions) and streaming app logs and metrics to devs with permissions in the app's space. This is also what the cf logs command uses. cf top also works this way (it enumerates all apps and streams the logs of each app).
However, you will find out that most open source tools that leverage the firehose only work in admin mode, since they're written for the platform operator.
Of course you also have the possibility to monitor your app by instrumenting it (white box approach), for example by configuring Spring actuator in a Spring boot app or by including an agent of your favourite APM vendor (Dynatrace, AppDynamics, ...)
I guess this is the most common approach; we've seen a lot of teams having success by instrumenting their applications. Especially since advanced monitoring anyway requires you to create your own metrics as the firehose provided cpu/memory metrics are not that powerful in a microservice world.
However, option 2. would be worth a try as well, especially since the ELK's stack metric support is getting better and better.

How to run multiple Kubernetes jobs in sequence?

I would like to run a sequence of Kubernetes jobs one after another. It's okay if they are run on different nodes, but it's important that each one run to completion before the next one starts. Is there anything built into Kubernetes to facilitate this? Other architecture recommendations also welcome!
This requirement to add control flow, even if it's a simple sequential flow, is outside the scope of Kubernetes native entities as far as I know.
There are many workflow engine implementations for Kubernetes, most of them are focusing on solving CI/CD but are generic enough for you to use however you want.
Argo: https://applatix.com/open-source/argo/
Added a custom resource deginition in Kubernetes entity for Workflow
Brigade: https://brigade.sh/
Takes a more serverless like approach and is built on Javascript which is very flexible
Codefresh: https://codefresh.io
Has a unique approach where you can use the SaaS to easily get started without complicated installation and maintenance, and you can point Codefresh at your Kubernetes nodes to run the workflow on.
Feel free to Google for "Kubernetes Workflow", and discover the right platform for yourself.
Disclaimer: I work at Codefresh
I would try to use cronjobs and set the concurrency policy to forbid so it doesn't run concurrent jobs.
I have worked on IBM TWS (Workload Automation) which is a scheduler similar to cronjob where you can mention the dependencies of the jobs.
You can specify a job to run only after it's dependencies has run using follows keyword.

difference between dcos-kafka-service and mesos-kafka

I’m doing a POC to deploy Kafka as an application on Mesos Cluster. I came across these 2 codebases on github. One developed by apache-mesos (github page) & other developed by mesosphere and can run only on DCOS (github page).
Question: Would like to know if there are any differences between DCOS-Kafka & mesos-Kafka in terms of features and extended functionality.
Regarding Mesos-Kafka:
I don’t see active participation on github (and some open issues) for mesos-kafka in the past months. Can I assume that the service is robust enough that I can use in production environment? Any Inputs on this would be helpful.
kakfa-mesos is a package that includes a release of Kafka and a custom mesos scheduler that was meant to work around issues with running Kafka as a stateful service on Marathon. I think post but confluent is useful. It also includes a RESTful api for doing ops tasks and aims to include these features in the future (this is pulled from the article I linked)
Integrating Kafka commands (e.g. kafka-topics, etc) into the Scheduler so it can be used through the CLI and REST API.
Auto-scaling clusters (including auto reassignment of partitions) so that the resources (CPU, RAM, etc.) that brokers are using can be used elsewhere in known valleys of traffic.
Rack-aware partition assignment for fault tolerance.
Hooks so that producers and consumers can also be launched from the Scheduler and managed with the cluster.
Automated partition reassignment based on load and traffic
I haven't used it in a production environment myself but it has the support of Confluent which is a good sign.
DC/OS Kafka on the other hand is a DC/OS service which will probably only be useful if you are already running or plan on running services through Mesosphere's DC/OS. It also includes an API and a CLI management tool but is less ambitious with additional features. It's current feature set includes
Single-command installation for rapid provisioning
Multiple clusters for multiple tenancy with DC/OS
High availability runtime configuration and software updates
Storage volumes for enhanced data durability, known as Mesos Dynamic * * Reservations and Persistent Volumes
Integration with syslog-compatible logging services for diagnostics and troubleshooting
Integration with statsd-compatible metrics services for capacity and performance monitoring