Using Kafka instead of Redis for the queue purposes - apache-kafka

I have a small project that uses Redis for the task queue purposes. Here is how it basically works.
I have two components in the system: desktop client (can be more than one) and a server-side app. Server-side app has a pull of tasks for the desktop client(s). When a client comes, the first available task from the pull is given to it. As the task has an id, when the desktop client gets back with the results, the server-side app can recognize the task by its id. Basically, I do the following in Redis:
Keep all the tasks as objects.
Keep queue (pool) of tasks in several lists: queue, provided, processing.
When a task is being provided to the desktop client, I use RPOPLPUSH in Redis to move the id from the queue list to the provided list.
When I get a response from the desktop client, I use LREM for the given task ID from the provided list (if it fails, I got a task that was not provided or was already processed, or just never existed - so, I break the execution). Then I use LPUSH to add the task id to the processing list. Given that I have unique task ids (controlled on the level of my app), I avoid duplicates in the Redis lists.
When the task is finished (the result got from the desktop client is processed and somehow saved), I remove the task from the processing list and delete the task object from Redis.
If anything goes wrong on any step (i.e. the task gets stuck on the processing or provided list), I can move the task back to the queue list and re-process it.
Now, the question: is it somehow possible to do the similar stuff in Apache Kafka? I do not need the exact behavior as in Redis - all I need is to be able to provide a task to the desktop client (it shouldn't be possible to provide the same task twice) and mark/change its state according to the actual processing status (new, provided, processing), so that I could control the process and restore the tasks that were not processed due to some problem. If it's possible, could anyone please describe the applicable workflow?

It is possible for kafka to act as a standard queue. Check the consumer group feature.
If the question is about the appropriateness, please also refer Is Apache Kafka appropriate for use as a task queue?
We are using kafka as a task queue, one of the consideration went in favor of kafka was that it is already in our application ecosystem, found it easier than adding one more component.

Related

Way to persist process spawned by a task on an agent?

I'm developing an Azure Devops extension with tasks in it. In one of the tasks, I'm starting a process and I'm doing configurations. In another task, I'm accessing the same process API to consume it. This is working perfectly fine, but I notice that after the job is done, my process is killed. I was planning to allow the user to do the configuration on an agent and be able to access it in another job or pipeline.
Is there a way to persist a process on an agent? I feel like the agent is killing every child processes created on cleanup. Where can I find documentation on this?
Edit: I managed to find this thread that talks about a certain Process.clean variable but there's not any more information about it and I didn't find documentation on it.
Your feeling is correct. Agents clean up spawned processes when the job finishes, and that's by design. A single machine can have multiple agents on it, and multiple agents can be running tasks in parallel. What if you have one machine with 10 agents on it, and they all start up this process at once?
IMO, the approach you're taking is suspect. If you need to persist information across jobs, there are numerous ways to do so (for example, an output variable containing JSON) that don't involve spawning a service that continues running outside the scope of the job that started it.

Zero downtime deployment of Slack bot

We develop bot with BotKit and now we try to solve problem with minimal deployment downtime.
There are the server and docker container running on this server. Inside container run bot-app instance connected with RTM-server (Slack).
When I start to deploy new version (v2) of bot-app, I want to get zero downtime, users should not see "bot is offline".
Deploy script runs second docker container with a new version of bot-app. And bot-app connect to RTM-server too. In this way, there are few seconds, when both apps run, connected to RTM-server and responds to user commands (and a user will to see two answers to his command).
What optimal decision I can get if on the one hand we want to get zero downtime and on the other hand, we want to prevent the user interact with the two instances at the same time?
Decision 1:
To allow small chance the likelihood of a collision, when both instances will respond to the user command.
Decision 2:
Abandon the zero downtime deployment. In this case, deploy script first stops the first docker-container, then start another one. The app will not respond to user commands, sent between stopping current version of the app and fully starting of a new version of an app.
Decision 3:
With an interact of parallel run current and new version of app or mutexes. General schematic:
1) Current version of app is running
2) Deploy script starts new version of app
3) I time when a new version of app almost run and ready to connect to RTM-server, it send to current version app command to close RTM-connection.
4) Current version of app closes RTM-connection
5) New version of app open RTM-connection
I think there are other good solutions.
How would you have solved this problem in your application?
(Sorry for the second reply; had another idea.)
The approach I described earlier would be pretty disruptive to your existing code, since you'd probably need to stop using botkit (or at least not use it to do the RTM API communication). An approach that may be less disruptive would be to use some sort of external way to signal that a given message is already been processed.
For example, using Redis, have the bot do the following command when a message comes in:
SET message:<message timestamp> 1 NX PX 30000
The NX option means this command will only succeed if the key doesn't already exist. So the first instance of the bot that manages to execute this will succeed, and the other instance will fail. The bot should only process the message and respond if this command succeeded.
(The PX 30000 sets a 30-second expiration so Redis doesn't get full of these keys.)
This should let you do your zero-downtime upgrades via overlapping the running bot instances without having to worry about a message being processed twice.
Note that it's still possible in this scheme for a message to be dropped altogether if a bot is shut down in a non-graceful way. (It could die just after calling the SET command but before it's actually dealt with the message.) A real queue with a two-phase "get/delete" would be better, but then you're back to my other answer. :-)
One idea I would consider is separating into two components:
A component that keeps a WebSocket connected to the Slack RTM API. This component simply reads messages from the API and puts them on to a queue. (Let's call this the "queuer.")
The actual "bot," which reads messages from the queue and responds as needed.
Depending on how your bot behaves, it can use the Web API directly or perhaps put its own messages on an outbound queue which the "queuer" can send via the RTM API.
This architecture probably solves your problem... you can now either take the bot down briefly while upgrading—responses will just be delayed until the new version is running—or you can run two versions of the bot at the same time and rely on the semantics of the queue to prevent both versions from responding to the same message.

Transactions across REST microservices?

Let's say we have a User, Wallet REST microservices and an API gateway that glues things together. When Bob registers on our website, our API gateway needs to create a user through the User microservice and a wallet through the Wallet microservice.
Now here are a few scenarios where things could go wrong:
User Bob creation fails: that's OK, we just return an error message to the Bob. We're using SQL transactions so no one ever saw Bob in the system. Everything's good :)
User Bob is created but before our Wallet can be created, our API gateway hard crashes. We now have a User with no wallet (inconsistent data).
User Bob is created and as we are creating the Wallet, the HTTP connection drops. The wallet creation might have succeeded or it might have not.
What solutions are available to prevent this kind of data inconsistency from happening? Are there patterns that allow transactions to span multiple REST requests? I've read the Wikipedia page on Two-phase commit which seems to touch on this issue but I'm not sure how to apply it in practice. This Atomic Distributed Transactions: a RESTful design paper also seems interesting although I haven't read it yet.
Alternatively, I know REST might just not be suited for this use case. Would perhaps the correct way to handle this situation to drop REST entirely and use a different communication protocol like a message queue system? Or should I enforce consistency in my application code (for example, by having a background job that detects inconsistencies and fixes them or by having a "state" attribute on my User model with "creating", "created" values, etc.)?
What doesn't make sense:
distributed transactions with REST services. REST services by definition are stateless, so they should not be participants in a transactional boundary that spans more than one service. Your user registration use case scenario makes sense, but the design with REST microservices to create User and Wallet data is not good.
What will give you headaches:
EJBs with distributed transactions. It's one of those things that work in theory but not in practice. Right now I'm trying to make a distributed transaction work for remote EJBs across JBoss EAP 6.3 instances. We've been talking to RedHat support for weeks, and it didn't work yet.
Two-phase commit solutions in general. I think the 2PC protocol is a great algorithm (many years ago I implemented it in C with RPC). It requires comprehensive fail recovery mechanisms, with retries, state repository, etc. All the complexity is hidden within the transaction framework (ex.: JBoss Arjuna). However, 2PC is not fail proof. There are situations the transaction simply can't complete. Then you need to identify and fix database inconsistencies manually. It may happen once in a million transactions if you're lucky, but it may happen once in every 100 transactions depending on your platform and scenario.
Sagas (Compensating transactions). There's the implementation overhead of creating the compensating operations, and the coordination mechanism to activate compensation at the end. But compensation is not fail proof either. You may still end up with inconsistencies (= some headache).
What's probably the best alternative:
Eventual consistency. Neither ACID-like distributed transactions nor compensating transactions are fail proof, and both may lead to inconsistencies. Eventual consistency is often better than "occasional inconsistency". There are different design solutions, such as:
You may create a more robust solution using asynchronous communication. In your scenario, when Bob registers, the API gateway could send a message to a NewUser queue, and right-away reply to the user saying "You'll receive an email to confirm the account creation." A queue consumer service could process the message, perform the database changes in a single transaction, and send the email to Bob to notify the account creation.
The User microservice creates the user record and a wallet record in the same database. In this case, the wallet store in the User microservice is a replica of the master wallet store only visible to the Wallet microservice. There's a data synchronization mechanism that is trigger-based or kicks in periodically to send data changes (e.g., new wallets) from the replica to the master, and vice-versa.
But what if you need synchronous responses?
Remodel the microservices. If the solution with the queue doesn't work because the service consumer needs a response right away, then I'd rather remodel the User and Wallet functionality to be collocated in the same service (or at least in the same VM to avoid distributed transactions). Yes, it's a step farther from microservices and closer to a monolith, but will save you from some headache.
This is a classic question I was asked during an interview recently How to call multiple web services and still preserve some kind of error handling in the middle of the task. Today, in high performance computing, we avoid two phase commits. I read a paper many years ago about what was called the "Starbuck model" for transactions: Think about the process of ordering, paying, preparing and receiving the coffee you order at Starbuck... I oversimplify things but a two phase commit model would suggest that the whole process would be a single wrapping transaction for all the steps involved until you receive your coffee. However, with this model, all employees would wait and stop working until you get your coffee. You see the picture ?
Instead, the "Starbuck model" is more productive by following the "best effort" model and compensating for errors in the process. First, they make sure that you pay! Then, there are message queues with your order attached to the cup. If something goes wrong in the process, like you did not get your coffee, it is not what you ordered, etc, we enter into the compensation process and we make sure you get what you want or refund you, This is the most efficient model for increased productivity.
Sometimes, starbuck is wasting a coffee but the overall process is efficient. There are other tricks to think when you build your web services like designing them in a way that they can be called any number of times and still provide the same end result. So, my recommendation is:
Don't be too fine when defining your web services (I am not convinced about the micro-service hype happening these days: too many risks of going too far);
Async increases performance so prefer being async, send notifications by email whenever possible.
Build more intelligent services to make them "recallable" any number of times, processing with an uid or taskid that will follow the order bottom-top until the end, validating business rules in each step;
Use message queues (JMS or others) and divert to error handling processors that will apply operations to "rollback" by applying opposite operations, by the way, working with async order will require some sort of queue to validate the current state of the process, so consider that;
In last resort, (since it may not happen often), put it in a queue for manual processing of errors.
Let's go back with the initial problem that was posted. Create an account and create a wallet and make sure everything was done.
Let's say a web service is called to orchestrate the whole operation.
Pseudo code of the web service would look like this:
Call Account creation microservice, pass it some information and a some unique task id 1.1 Account creation microservice will first check if that account was already created. A task id is associated with the account's record. The microservice detects that the account does not exist so it creates it and stores the task id. NOTE: this service can be called 2000 times, it will always perform the same result. The service answers with a "receipt that contains minimal information to perform an undo operation if required".
Call Wallet creation, giving it the account ID and task id. Let's say a condition is not valid and the wallet creation cannot be performed. The call returns with an error but nothing was created.
The orchestrator is informed of the error. It knows it needs to abort the Account creation but it will not do it itself. It will ask the wallet service to do it by passing its "minimal undo receipt" received at the end of step 1.
The Account service reads the undo receipt and knows how to undo the operation; the undo receipt may even include information about another microservice it could have called itself to do part of the job. In this situation, the undo receipt could contain the Account ID and possibly some extra information required to perform the opposite operation. In our case, to simplify things, let's say is simply delete the account using its account id.
Now, let's say the web service never received the success or failure (in this case) that the Account creation's undo was performed. It will simply call the Account's undo service again. And this service should normaly never fail because its goal is for the account to no longer exist. So it checks if it exists and sees nothing can be done to undo it. So it returns that the operation is a success.
The web service returns to the user that the account could not be created.
This is a synchronous example. We could have managed it in a different way and put the case into a message queue targeted to the help desk if we don't want the system to completly recover the error". I've seen this being performed in a company where not enough hooks could be provided to the back end system to correct situations. The help desk received messages containing what was performed successfully and had enough information to fix things just like our undo receipt could be used for in a fully automated way.
I have performed a search and the microsoft web site has a pattern description for this approach. It is called the compensating transaction pattern:
Compensating transaction pattern
All distributed systems have trouble with transactional consistency. The best way to do this is like you said, have a two-phase commit. Have the wallet and the user be created in a pending state. After it is created, make a separate call to activate the user.
This last call should be safely repeatable (in case your connection drops).
This will necessitate that the last call know about both tables (so that it can be done in a single JDBC transaction).
Alternatively, you might want to think about why you are so worried about a user without a wallet. Do you believe this will cause a problem? If so, maybe having those as separate rest calls are a bad idea. If a user shouldn't exist without a wallet, then you should probably add the wallet to the user (in the original POST call to create the user).
IMHO one of the key aspects of microservices architecture is that the transaction is confined to the individual microservice (Single responsibility principle).
In the current example, the User creation would be an own transaction. User creation would push a USER_CREATED event into an event queue. Wallet service would subscribe to the USER_CREATED event and do the Wallet creation.
If my wallet was just another bunch of records in the same sql database as the user then I would probably place the user and wallet creation code in the same service and handle that using the normal database transaction facilities.
It sounds to me you are asking about what happens when the wallet creation code requires you touch another other system or systems? Id say it all depends on how complex and or risky the creation process is.
If it's just a matter of touching another reliable datastore (say one that can't participate in your sql transactions), then depending on the overall system parameters, I might be willing to risk the vanishingly small chance that second write won't happen. I might do nothing, but raise an exception and deal with the inconsistent data via a compensating transaction or even some ad-hoc method. As I always tell my developers: "if this sort of thing is happening in the app, it won't go unnoticed".
As the complexity and risk of wallet creation increases you must take steps to ameliorate the risks involved. Let's say some of the steps require calling multiple partner apis.
At this point you might introduce a message queue along with the notion of partially constructed users and/or wallets.
A simple and effective strategy for making sure your entities eventually get constructed properly is to have the jobs retry until they succeed, but a lot depends on the use cases for your application.
I would also think long and hard about why I had a failure prone step in my provisioning process.
One simple Solution is you create user using the User Service and use a messaging bus where user service emits its events , and Wallet Service registers on the messaging bus, listens on User Created event and create Wallet for the User. In the mean time , if user goes on Wallet UI to see his Wallet, check if user was just created and show your wallet creation is in progress, please check in some time
What solutions are available to prevent this kind of data inconsistency from happening?
Traditionally, distributed transaction managers are used. A few years ago in the Java EE world you might have created these services as EJBs which were deployed to different nodes and your API gateway would have made remote calls to those EJBs. The application server (if configured correctly) automatically ensures, using two phase commit, that the transaction is either committed or rolled back on each node, so that consistency is guaranteed. But that requires that all the services be deployed on the same type of application server (so that they are compatible) and in reality only ever worked with services deployed by a single company.
Are there patterns that allow transactions to span multiple REST requests?
For SOAP (ok, not REST), there is the WS-AT specification but no service that I have ever had to integrate has support that. For REST, JBoss has something in the pipeline. Otherwise, the "pattern" is to either find a product which you can plug into your architecture, or build your own solution (not recommended).
I have published such a product for Java EE: https://github.com/maxant/genericconnector
According to the paper you reference, there is also the Try-Cancel/Confirm pattern and associated Product from Atomikos.
BPEL Engines handle consistency between remotely deployed services using compensation.
Alternatively, I know REST might just not be suited for this use case. Would perhaps the correct way to handle this situation to drop REST entirely and use a different communication protocol like a message queue system?
There are many ways of "binding" non-transactional resources into a transaction:
As you suggest, you could use a transactional message queue, but it will be asynchronous, so if you depend on the response it becomes messy.
You could write the fact that you need to call the back end services into your database, and then call the back end services using a batch. Again, async, so can get messy.
You could use a business process engine as your API gateway to orchestrate the back end microservices.
You could use remote EJB, as mentioned at the start, since that supports distributed transactions out of the box.
Or should I enforce consistency in my application code (for example, by having a background job that detects inconsistencies and fixes them or by having a "state" attribute on my User model with "creating", "created" values, etc.)?
Playing devils advocate: why build something like that, when there are products which do that for you (see above), and probably do it better than you can, because they are tried and tested?
In micro-services world the communication between services should be either through rest client or messaging queue. There can be two ways to handle the transactions across services depending on how are you communicating between the services. I will personally prefer message driven architecture so that a long transaction should be a non blocking operation for a user.
Lets take you example to explain it :
Create user BOB with event CREATE USER and push the message to a message bus.
Wallet service subscribed to this event can create a wallet corresponding to the user.
The one thing which you have to take care is to select a robust reliable message backbone which can persists the state in case of failure. You can use kafka or rabbitmq for messaging backbone. There will be a delay in execution because of eventual consistency but that can be easily updated through socket notification. A notifications service/task manager framework can be a service which update the state of the transactions through asynchronous mechanism like sockets and can help UI to update show the proper progress.
Personally I like the idea of Micro Services, modules defined by the use cases, but as your question mentions, they have adaptation problems for the classical businesses like banks, insurance, telecom, etc...
Distributed transactions, as many mentioned, is not a good choice, people now going more for eventually consistent systems but I am not sure this will work for banks, insurance, etc....
I wrote a blog about my proposed solution, may be this can help you....
https://mehmetsalgar.wordpress.com/2016/11/05/micro-services-fan-out-transaction-problems-and-solutions-with-spring-bootjboss-and-netflix-eureka/
Eventual consistency is the key here.
One of the services is chosen to become primary handler of the event.
This service will handle the original event with single commit.
Primary handler will take responsibility for asynchronously communicating the secondary effects to other services.
The primary handler will do the orchestration of other services calls.
The commander is in charge of the distributed transaction and takes control. It knows the instruction to be executed and will coordinate executing them. In most scenarios there will just be two instructions, but it can handle multiple instructions.
The commander takes responsibility of guaranteeing the execution of all instructions, and that means retires.
When the commander tries to effect the remote update and doesn’t get a response, it has no retry.
This way the system can be configured to be less prone to failure and it heals itself.
As we have retries we have idempotence.
Idempotence is the property of being able to do something twice such a way that the end results be the same as if it had been done once only.
We need idempotence at the remote service or data source so that, in the case where it receives the instruction more than once, it only processes it once.
Eventual consistency
This solves most of distributed transaction challenges, however we need to consider couple of points here.
Every failed transaction will be followed by a retry, the amount of attempted retries depends on the context.
Consistency is eventual i.e., while the system is out of consistent state during a retry, for example if a customer has ordered a book, and made a payment and then updates the stock quantity. If the stock update operations fail and assuming that was the last stock available, the book will still be available till the retry operation for the stock updating has succeeded. After the retry is successful your system will be consistent.
Why not use API Management (APIM) platform that supports scripting/programming? So, you will be able to build composite service in the APIM without disturbing micro services. I have designed using APIGEE for this purpose.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

How to Implement Queue Based Workflow System?

I'm working on a document management system. An example workflow would be something like this:
A document is emailed to the system
The system does a number of preparatory actions to the document
Document is presented to a user for further processing
Afterwards, document is sent to Quality Assurance
Afterwards, the system does a number or post-processing actions to the document
Document is considered completely processed and disseminated (e.g. emailed back to whoever emailed the document to the system, etc.)
Since the volume of my input will vary (but will usually be high volume), I am very concerend about scalability.
For example, say the system has already downloaded the email attachments. If the attachments are PDF documents, the system needs to split the PDF into individual pages, then convert each page into multiple size thumbnails, etc. I plan to have a cron job check (say, every minute) to see if there are an PDF documents that need to be processed. Using a flagging system (e.g. "PDF Document Ready to be Processed"), I can check the database for all PDF documents that are flagged to be processed. Once the PDF processing is done, the flag can be updated to say "PDF Processing Done."
However, since the processing of each PDF document is very time consuming, I am concerned that when the next cron job is executed, that cron job will also try to process the PDFs that the previous cron job is still processing.
A possible solution is to immediately flag the PDF documents with "PDF Document Currently Being Processed." That way, when the next cron job is executed, it will exclude the ones already being processed.
Thus, each step in the workflow will probably have 3 flags:
PDF Document Ready to be Processed
PDF Document Currently Being Processed
PDF Processing Done
Same for QA:
Document Ready for QA
Document Currently Being QAd
Document QA Done
Is this a good approach? Is there a better approach? Would I have these flags as a single column of the "PDF Document" table in the database? Or should the flags be its own table (e.g. especially if a document can have multiple flags set).
I'd like to solicit suggestions on how to implement such a system.
To solve your concern about concurrent processing on the same document, you can use many scheduler packages to help you manage this aspect. http://www.quartz-scheduler.org/ is one I've used with great success.
To address your problem, I'd have the 3 states, received, queued, processed (similar to what you suggest).
I'd have a scheduled recurring job which polls the database, looking for received pdfs, and for each, queue a job to process and mark the pdf as queued. If you ensure this happens in the same transaction, and utilize optimistic locking, there is no risk another job could come along and re-read this as received.
Quartz uses a thread pool, with may configuration options, and is great for deferred, resource intensive processing (I use it for image thumbnailing in a server setting).
To take a step back, there are some great workflow packages in the java world which can handle most of what you want to do, including the deferred pdf processing. Take a look at jbpm or drools flow, these are two great, if complex, packages.
UPDATE: Drools Flow has been merged into JBPM. For this particular problem it may be a bit of "killing a mosquito with a bazooka" situation, but it's a great workflow package.
The solution kind of depends on what technologies you are using to implement this system is the pre / post processing done by the same software / language as the emailing software? Additionally are they running in seperate processes.
If you have distributed components you could do much worse than investigating an AMQP solution like RabbitMQ, as this takes care of putting each job into a queue, and making sure that only one of your consumers takes each job. (we'd model each thumbnailing job as individual tasks).
If however the entire system is implemented in one language, and inside a single process there's some simpler systems you can use:
Resque is a good solution for Ruby
Java would work well as a LinkedBlockingQueue
Uh, I'm sure c# will have some way of creating a queue of jobs (disclaimer: I know nothing of c#)