Distributed systems with large number of different types of jobs - distributed-computing

I want to create a distributed system that can support around 10,000 different types of jobs. One single machine can host only 500 such jobs, as each job needs some data to be pre-loaded into memory, which can't be kept in a cache. Each job must have redundancy for availability.
I had explored open-source libraries like zookeeper, hadoop, but none solves my problem.
The easiest solution that I can think of, is to maintain a map of job type, with its hosted machine. But how can I support dynamic allocation of job type on my fleet? How to handle machine failures, to make sure that each job type must be available on atleast 1 machine, at any point of time.

Based on the answers that you mentioned in the comments, I propose you to go for a MQ-based (Message Queue) architecture. What I propose in this answer is to:
Get the input from users and push them into a distributed message queue. It means that you should set up a message queue (Such as ActiveMQ or RabbitMQ) on several servers. This MQ technology, helps you to replicate the input requests for fault tolerance issues. It also provides a full end-to-end asynchronous system.
After preparing this MQ layer, you can setup you computing servers layers. This means that some computing servers (~20 servers in your case) will read the requests from the message queue and start a job based on the request. Because this MQ is distributed, you can make sure that a good level of load balancing can happen in your computing servers. In addition, each server is capable of running as much as jobs that you want (~500 in your case) based on the requests that it reads from the MQ.
Regarding the failures, the computing servers may only pop from the MQ, if and only if the job is completed. If one server is crashing, the job is still in the MQ and another server can work on it. If the job is saving some state somewhere or updates something, you should manage its duplicate run then.
The good point about this approach is that it is very salable. It means that if in future you have more jobs to handle, by adding a computing server and connecting it to the MQ, you can process more requests on the servers without any change to the system. In addition, some nice features in the MQ like priority-based queuing, helps you to prioritize the requests and process them based on the job type.
p.s. Your Q does not provide any details about the type and parameters of the system. This is a draft solution that I can propose. If you provide more details, maybe the community can help you more.


Real Time Streaming With Multiple Data Sources Using Kafka

We are planning to build a real time monitoring system with apache kafka. The overall idea is to push data from multiple data sources to kafka and perform data quality checks. I have few questions with this architecture
What are the best possible approaches of streaming data from multiple sources which mainly include java applications, oracle database, rest api's, log files to apache kafka? Note each client deployment includes each of such data sources. Hence the number of data sources pushing data to kafka would be equal to the number of customers * x where x are the types of data sources that I listed. Ideally a push approach would suit best instead of a pull approach. In the pull approach the target system would have to be configured with the credentials of various source system which would not be practical
How do we handle failures?
How do we perform data quality checks on the incoming messages? For e.g. If a certain message does not have all the required attributes, the message could be discarded and an alert could be raised for the maintenance team to check.
Kindly let me know your expert inputs. Thanks !
I think the best approach here is to use Kafka connect: link
but it's a pull approach :
Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers. Ewen

Strategy for distributed-computing inside microservices architecture?

I am looking for advice for the following problem:
I am working with other people on a microservices architecture where the microservices are distributed on different machines. Resources on the machines are very limited.
Currently, communication runs through a message broker.
In my use case, one microservice occasionally needs to run some heavy computation. I would like to perform the computation on a machine with low CPU usage and enough available memory space.
My first idea is that every machine installs a microservice which publishes CPU usage and available memory space in the message broker. Each microservice that needs to distribute their workload is looking for the fittest machines and installs "worker"-microservices on the fly. Results are published in the message broker. Since resources are limited, worker-microservices are uninstalled when not needed anymore.
I haven't found a similar use case yet. Do you guys know a better existing solution?
I am quite new to the topic of microservices and distributed computing, so i would appreciate some advice and help.

How to submit "tasks" in paralell on a server

first happy new year to everybody and happy coding for 2017.
I have 1M of "tasks" to run using python. Each task will take around 2 min and will process some local images. I would like to run as much as possible in parallel in an automatic way. My server has 40 cores so I started to see how to do multiprocessing but I see the following issues:
Keeping the log of each task is not easy (I am working on it but so far I didn't succeed even if I found many example on stackoverflow)
How to I know how many CPU should I use and how many should be left to the server for basic server task ?
When we have multi user on the server how can we see how many CPU are already used ?
In my previous life as physicist at CERN we were using job submission system to submit tasks on many clusters. Tasks were put in a queue and process when a slot was available. Do we have such tool for a LINUX sever as well? I don't know what is the correct English name for such tool (job dispatcher ?).
The best will be a tool that we can configure to use our N CPU as "vehicle" to process in parallel task (and that keep the needed CPU so that the server can run basic task as well), put the job of all users in a queues with priority and process them "vehicle" are available. Bonus will be a way to monitor task processing.
I hope I am using the correct word to describe what I want.
What you are talking about is generally referred as "Pool of Workers". It can be implemented using Threads or Processes. The implementation choice depends on your workflow.
A pool of workers allows you to choose the number of workers to use. Furthermore, the pool usually has a queue in front of the workers to de-couple them from your main logic.
If you want to run tasks within a single server, then you can either use multiprocessing.Pool or concurrent.futures.Executor.
If you want to distribute tasks over a cluster, there are several solutions. Celery and Luigi are good examples.
This is not your concern as a User. Modern Operating Systems do a pretty good job in sharing resources between multiple Users. If overcommitting resources becomes a concern, the SysAdmin should make sure this does not happen by assigning quotas per User. This can be done in plenty of ways. An example tool sysadmins should be familiar with is ulimit.
To put it in other words: your software should not do what Operating Systems are for: abstracting the underlying machine to offer to your software a "limitless" set of resources. Whoever manages the server should be the person telling you: "use at most X CPUs".
Probably, what you were using at CERN was a system like Mesos. These solutions aggregate large clusters in a single set of resources which you can schedule tasks against. This works if all the users are accessing to the cluster through it though.
If you are sharing a server with other people, either you agree together on the quotas or you all adopt a common scheduling framework such as Celery.

Parallelism behaviour of stream processing engines

I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)
However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.
Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.
Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.
My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.
Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.
If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.
you are right to consider that sending data across the network will consume more time of the total processing time.
However, these frameworks (Storm, Spark, Samza, Flink) were created to process a lot of data that potentially does not fit in memory of one computer. So, if we use more than one computer to process the data we can achieve parallelism.
And, following your question about the network latency. Yes! this is a trade off to consider. The developer has to know that they are implementing programs to deploy in a parallel framework. The way that they build the application will influence how much data is transferred through the network as well.

Interprocess messaging - MSMQ, Service Broker,?

I'm in the planning stages of a .NET service which continually processes incoming messages, which involves various transformations, database inserts and updates, etc. As a whole, the service is huge and complicated, but the individual tasks it performs are small, simple, and well-defined.
For this reason, and in order to allow for easy expansion in future, I want to split the service into several smaller services which basically perform part of the processing before passing it onto the next service in the chain.
In order to achieve this, I need some kind of intermediary messaging system that will pass messages from one service to another. I want this to happen in such a way that if a link in the chain crashing or is taken offline briefly, the messages will begin to queue up and get processed once the destination comes back online.
I've always used message queuing for this type of thing, but have recently been made aware of SQL Service Broker which appears to do something similar. Is SQLSB a viable alternative for this scenario and, if so, would I see any performance benefits by using that instead of standard Message Queuing?
It sounds to me like you may be after a service bus architecture. This would provide you with the coordination and fault tolerance you are looking for. I'm most familiar and partial to NServiceBus, but there are others including Mass Transit and Rhino Service Bus.
If most of these steps initiate from a database state and end up in a database update, then merging your message storage with your data storage makes a lot of sense:
a single product to backup/restore
consistent state backups
a single high-availability/disaster recoverability solution (DB mirroring, clustering, log shipping etc)
database scale storage (IO capabilities, size and capacity limitations etc as per the database product characteristics, not the limits of message store products).
a single product to tune, troubleshoot, administer
In addition there are also serious performance considerations, as having your message store be the same as the data store means you are not required to do two-phase commit on every message interaction. Using a separate message store requires you to enroll the message store and the data store in a distributed transaction (even if is on the same machine) which requires two-phase commit and is much slower than the single-phase commit of database alone transactions.
In addition using a message store in the database as opposed to an external one has advantages like queryability (run SELECT over the message queues).
Now if we translate the abstract terms 'message store in the database as being Service Broker and 'non-database message store' as being MSMQ, you can see my point why SSB will run circles any time around MSMQ.
My recent experiences with both approaches (starting with Sql Server Service Broker) led me to the situation in which I cry for getting my messages out of SQL server. The problem is quasi-political but you might want to consider it: SQL server in my organisation is managed by a specialized DBA while application servers (i.e. messaging like NServiceBus) by developers and network team. Any change to database servers requires painful performance analysis from DBA and is immersed in fear that we might get standard SQL responsibilities down by our queuing engine living in the same space.
SSSB is pretty difficult to manage (not unlike messaging middleware) but the difference is that I am more allowed to screw something up in the messaging world (the worst that may happen is some pile of messages building up somewhere and logs filling up) and I can't afford for any mistakes in SQL world, where customer transactional data live and is vital for business (including data from legacy systems). I really don't want to get those 'unexpected database growth' or 'wait time alert' or 'why is my temp db growing without end' emails anymore.
I've learned that application servers are cheap. Just add message handlers, add machines... easy. Virtually no license costs. With SQL server it is exactly opposite. It now appears to me that using Service Broker for messaging is like using an expensive car to plow potato field. It is much better for other things.