how to download multiple files from one server and upload to another concurrently in microservice architecture (concurrent piping) - scala

I'm writing a (scala/jvm) microservice that is part of a CI solution.
Its job is to download built artifacts from some external build tool on the cloud and upload them to repositories from which they can be consumed, such as a docker registry or maven style repositories like Nexus.
There are many many such files for each build and many many such builds all the times, so the problem to solve is that of scale.
My microservice is integrated with an event queue (kafka), so it's easy to asynchronously assign tasks to workers.
I'm looking for the best way to manage my resources: nodes of cluster, threads, io, memory, storage - to handle the download and upload of all files, preferably without storing the entire content of each file locally on a file or in memory, but just to pipe from the source server to the target server.
I'm not sure what's the best approach to actually write the pipe code itself or how to best use the workers.
I was thinking of dispatching an event per file-to-pipe, and in each worker to pipe that one file by performing a get operation on the input server, a post operation on the target server and creating an in memory pipe between the streams with some buffer.
In this scenario there could be different transfer speeds for the input and target servers and i'm not sure if that's a problem or not. I think this should be solved by TCP/IP at the OS level and nothing for me to handle applicatively. I think if i use different thread pools for the download client and the upload client i can expect decent usage of non-blocking io to perform the pipe.
Alternatively i can do something else entirely and do some sort of producer/consumer where some workers download files while others upload them? this means more storage and shared storage at that, and a custom configuration for this microservice, which i'm not excited about.
Any other suggestions/insights are also welcome.
The eventual solution should (hopefully) be robust, scalable and as simple as possible.

Are you positive the source cloud service is not going to offer an "export file to Nexus" solution in the near/medium future? If so maybe your solution does not have to be fully efficient.
I would look at FS2 for this job https://github.com/functional-streams-for-scala/fs2/blob/series/1.0/README.md#example

Related

Spring Cloud Data Flow: versioned streams

I'm implementing a stream pipe with Spring Cloud Data Flow.
My problem is that I configured MANUALLY the pipe (e.g. http | log_sink) in the server and it will be lost if I reset that server (think in an Amazon EC2 instance that can be hard reseted).
Which is the suggested way to keep versioning of pipes using SCDF?
Thanks.
I am summarizing the discussion from the comments.
To automate the promotion of Stream/Task workloads from lower to higher-level environments, the recommended approach would be the use of SCDF's Java DSL. With this, users can programmatically register, create, deploy, or launch stream/task in a repeatable manner and across many different platforms simultaneously (if there's a need for it). The Boot App built with the Java DSL can be versioned in Git, and it can be CD/GitOps friendly. With sufficient generalization to this App, it can also be reused by many different teams by overriding the defaults.
We put this for use in the product proper for or IT and Acceptance tests, which run on every upstream commit daily across multiple Kubernetes and Cloud Foundry installations.
Alternatively, all of the register, create, deploy, or launch stream/task commands can also be dumped in a text or a property file. Once when you have the file, the dataflow:>script --file command can help slurp in all the commands in each of the new environments — see docs.

Using Zookeeper or equivalent for .NET configuration management?

I have a proprietary CMS that keeps a lot (20k lines) of configuration files on disk. I have quite a few nodes, all with the same configurations except for one or two elements that designate the node name and the IP.
Since this is proprietary I do not have a lot of leverage for going in and completely overhauling the configuration loading to look at an endpoint, though I might be able to be creative.
My questions are simple, but I do not know a better place to answer them:
Is this a use case for distributed configuration management like Zookeeper? Ideally I'd like to spin up a box and have it look for a service endpoint to load config files rather than have the config files deployed through source. This way I can update the configuration in one place, and have it replicate to all nodes without doing a full deployment.
Can Zookeeper (or equivalent) mimic a file system? Could I mount an NFS point and have it expose configuration as if they were files on the filesystem, even if these are symbolic constructs? Does this make sense?
Your configuration use case seems more like a a job for chef, puppet or similar system. They will allow you to update the configuration in one place, keep them version controlled, and distribute them properly to all target nodes.
Zookeeper makes sense when your application/service needs to dynamically get fresh configuration data during live operation, and when multiple nodes in your system need the same consistent view of that data. If you don't have this requirements, Zookeeper might be too much of an overhead for just laying down mostly static config files on disk.
As for mimicking a filesystem, there is zkfuse which you could use to mount it. But again, it doesn't look like this is what you want. Zookeeper should not be used as an actual file system replacement or file distribution system. It is best for storing small bits of metadata that needs to be consistent across your distributed system.

Persistent storage for Apache Mesos

Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!

Good practices of Websphere MQ production deployment

I'm about to prepare a deployment specification for the Websphere MQ production environment. As always I hate reinventing the wheel hence the question:
Is there an article, specififaction of best practices when it comes to deploying and maintaining the Webshpere MQ production environment?
Here are more specific doubts of mine:
Configuration versioning (MQSC, dmpmqcfg, etc).
Deploying new objects (MQSC or manual instructions?)
Deployment automation (maybe basing on the diff of dmpmqcfg?).
Deploying and versioning configuration alterations.
Currently I am simply creating MQ objects manually and versioning the output of dmpmqcfg. However, in a while there are going to be too many deployments to handle it like this.
That's an extremely broad question so I'll try to respond before a moderator deletes it. :-)
The answer depends on many things such as whether MQ clusters are in use, the approaches to high availability and disaster recovery, the security requirements, whether the QMgrs are configured as dedicated or shared infrastructure, etc. However, there are a few patterns that I follow in almost all cases, including non-Production. This is because things like monitoring and security tend to get dropped at deployment time if not tested in Dev and don't work as expected in Prod.
I use a script to create my QMgrs in Production to insure that basics like generating the X.509 certificate (or CSR) is always done according to standards, that any exits or exit parm files are present, that certain SupportPac executables (like q) are present in /opt/mqm/bin, circular queues, etc. It also checks for negative factors such as GSKit not being installed.
I have a baseline script that is run against all QMgrs. This script sets up the DLQ, any queues for monitoring agents, enables events as required, sets up system services, trigger monitors, listeners, etc. The exception is B2B gateway QMgrs which are handled in a class all their own and have very specific configurations not used on the internal network.
cluster.
I have several classes of QMgr with specific configuration requirements. These include cluster repositories (where primary and secondary are distinct sub-types), service-provider QMgrs, and service consumer QMgrs. These all have secondary scripts run against them.
I have scripts per-cluster to join or suspend a QMgr in cases where clustering is used (which for me is almost 100% since v7.1).
These set up a QMgr's infrastructure. Then I maintain scripts for each application. So for example, if there's a Payroll app, I'll have queues and possibly topics with names containing a PAY node such as PAY.EMPLOYEE.UPDT.REQ.V032.PRD. Corresponding to that will be a single script for all PAY.** queues. Used to be one for setmqaut commands too, but these are now in the same script as the objects. I only ever have one version of the script and keep a history of changes in the script. This way when I need to recreate a QMgr, I just run all the scripts for it. Similarly, if I need to deploy the PAY objects on another QMgr, I just copy the script to that server.
When defining objects for clusters, I always do a DEFINE NOREPLACE that contains all the run-time attributes such as whether the queue is enabled in the cluster. The queue is always defined as disabled in the cluster and for triggering but because I use NOREPLACE re-running the script doesn't change whatever state it has in, say, a month. Those things that are configuration and not run-time, such as the description, are handled in an ALTER immediately after the DEFINE and these are updated each time the script is run. There's an article on this here.
Finally, the scripts I use are of the self-executing, self-documenting variety. For example, many people put all the MQSC commands into a script then do something like:
runmqsc < payroll.mqsc > payroll.out
TONS of problems here. The main one is that it relies on the operator to know a lot and execute the script right all the time. For example, suppose (s)he forgets to capture the output? Or overwrites a previous output? Or doesn't get STDERR because (s)he needs to do the 2>&1 at the end and doesn't know redirection that well?
So my scripts are all written in ksh handle all the capturing of output, complete with time and date stamping and STDERR, can freely mix MQSC with OS commands, etc. All you do is go to the scripts directory for that QMgr and . ./*ksh to build/rebuild a QMgr.
I do of course also take regular configuration dumps, but these are more for running queries and reports like "how many QMgrs have this channel defined and where are they?" kind of thing.
Also, when taking backups there is almost NEVER a good reason to back up a QMgr at a point in time. However, if it is required be sure to stop the QMgr first. Also, think long and hard about capturing certificates in a backup. Many people are good about locking the certificate directory so only mqm can read it but often the backups are unprotected. As long as you aren't trying to restore on top of Production, many shops let you restore the Production /var/mqm/* files to your own sandbox. If the QMgr's KDB files are included, you just lost them. An alternative is to put the certificates in /etc or some other directory that is protected but not backed up with the QMgr's directories.

Is there a way to make jobs in Jenkins mutually exclusive?

I have a few jobs in Jenkins that use Selenium to modify a database through a website's front end. If some of these jobs run at the same time, errors due to dirty reads can result. Is there a way to force certain jobs in Jenkins to be unable to run at the same time? I would prefer not to have to place or pick up a lock on the database, which could be read or modified by any number of users who are also testing.
You want the Throttle Concurrent Builds plugin which lets you define global and per-node semaphores.
Locks and latches is being deprecated in favor of Throttle Concurrent builds.
I've tried both the locks & latches plugin and the port allocator plugin as ways to achieve what you're trying to do. Neither worked reliably for me. Locks & latches worked some of the time, but I'd occasionally get hung jobs. Using port allocator as a hack will work unless you have multiple jenkins nodes, but the config overhead is kind of high. What I've ultimately settled upon is another hack, but it works reliably and uses core Jenkins stuff (no plugins):
create a slave node running on the same box as the master (or not, if you have lots of boxes)
give this slave a single executor (key)
tie the 2 (or n) jobs that must not run together to this new slave node
optionally set the slave's usage to 'tied jobs only' if it'll screw up your other jobs if they happen to run on the new slave
Since the slave has only one executor, the jobs tied to it can never run together.
If you regard the database as a shared resource that can only be used exclusively then this fits the usecase of the Lockable resources plugin.
It is being actively developed and improved and is very versatile.