Is it inefficient for cadence to use the database to implement scheduled tasks? - uber-api

The scheduled tasks of cadence are stored in the database, so the problem of database polling must be involved. As the amount of data increases, how does cadence improve the query efficiency and the accuracy of delay time?

Cadence uses sharding to scale out. Each shard has its own timer queue. If a single DB instance cannot keep up, multiple DB hosts can be used and Cadence history service shards are distributed across all the hosts.
My talk Designing a Workflow Engine from First Principles goes into this in more detail. It talks about Temporal which is a fork of Cadence, but that part of the design is exactly the same.

Related

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

IExecutorService's main motive?

I know about high availability and scalability (etc) advantages of Hazelcast. But i just want to ask about main motive of distributed executor service and i have some questions in my mind. kindly just answer the following questions
If client load on the server is only in the form of blocking I/O requests(Data base queries etc) then is there a need to use IExecutorService or ThreadPoolExecutor is enough for this scenario?
If client load on the server is only in the form of CPU-intensive requests but request rate is high then IExecutorService can serve this scenario better on cluster, is this statement true?
The main motive of IExecutorService is to handle CPU-intensive request's load on the cluster by horizontal scaling.Is this statement true?
If client load on the server is only in the form of blocking I/O requests(Data base queries etc) then is there a need to use IExecutorService or ThreadPoolExecutor is enough for this scenario?
It depends. It doesn't only need to be CPU intensive tasks. For example if each tasks requires doing a lot of IO, but this resource is scalable, e.g.:
- the local file system of a member machine,
- another cluster (maybe there is a big Cassandra cluster) that stores data
it could still be a good use-case for HZ.
If you are using HZ to scale up doing remote calls to a db, it could very well that you bring the database to it knees :)
If client load on the server is only in the form of CPU-intensive requests but request rate is high then IExecutorService can serve this scenario better on cluster, is this statement true?
It depends. You pay a price for the PRC, so if you have very small tasks, it could very well be that the IExecutorService is not your friend. For similar reasons it could be that the Executor is not your friend, because there could be a huge contention on the work-queue of the executor.
So it depends on the type of task being processed if it makes sense to use the IExecutorService or even an Executor.
The main motive of IExecutorService is to handle CPU-intensive request's load on the cluster by horizontal scaling.Is this statement true?
See answer #1
There are not absolute answers to your questions. It very depends on a lot of factors.

Reuse mongo internal distributed locks

I need an distributed lock implementation for my application. I have a number of independent worker processes and I need to enforce a restriction that they can only work on a single account at one time.
The application is writen in c# with a mongo db layer. I noticed that mongo's cluster balancer uses a distributed lock mechanism to control which mongos is doing the balancing and I was wondering if I could reuse the same mechanism in my app?
I'd rather not have the overhead of implementing my own distributed lock mechanism and since all the worker processes alreading interface with mongo so it would be great if I could reuse their implementation.
There is no inherent document-level locking or distributed lock driver API in MongoDB.
MongoDB's internal locks for sharding splits & migrations use a two phase commit pattern against the shard cluster's config servers. You could implement a similar pattern yourself, and there is an example in the MongoDB documentation: Perform Two Phase Commits.
This is likely overkill if you just need a semaphore to prevent workers simultaneously updating the same account document. A more straightforward approach would be to add an advisory lock field (or embedded doc) to your account document to indicate the worker process that is currently using the document. The lock could be set when the worker starts and removed when it finishes. You probably want the lock information to include both a worker process ID and timestamp, so stale locks can be found & removed.
Note that any approach requires coordination amongst your worker processes to check and respect your locking implementation.

Why do we need message brokers like RabbitMQ over a database like PostgreSQL?

I am new to message brokers like RabbitMQ which we can use to create tasks / message queues for a scheduling system like Celery.
Now, here is the question:
I can create a table in PostgreSQL which can be appended with new tasks and consumed by the consumer program like Celery.
Why on earth would I want to setup a whole new tech for this like RabbitMQ?
Now, I believe scaling cannot be the answer since our database like PostgreSQL can work in a distributed environment.
I googled for what problems does the database poses for the particular problem, and I found:
polling keeps the database busy and low performing
locking of the table -> again low performing
millions of rows of tasks -> again, polling is low performing
Now, how does RabbitMQ or any other message broker like that solves these problems?
Also, I found out that AMQP protocol is what it follows. What's great in that?
Can Redis also be used as a message broker? I find it more analogous to Memcached than RabbitMQ.
Please shed some light on this!
Rabbit's queues reside in memory and will therefore be much faster than implementing this in a database. A (good)dedicated message queue should also provide essential queuing related features such as throttling/flow control, and the ability to choose different routing algorithms, to name a couple(rabbit provides these and more). Depending on the size of your project, you may also want the message passing component separate from your database, so that if one component experiences heavy load, it need not hinder the other's operation.
As for the problems you mentioned:
polling keeping the database busy and low performing: Using Rabbitmq, producers can push updates to consumers which is far more performant than polling. Data is simply sent to the consumer when it needs to be, eliminating the need for wasteful checks.
locking of the table -> again low performing: There is no table to lock :P
millions of rows of task -> again polling is low performing: As mentioned above, Rabbitmq will operate faster as it resides RAM, and provides flow control. If needed, it can also use the disk to temporarily store messages if it runs out of RAM. After 2.0, Rabbit has significantly improved on its RAM usage. Clustering options are also available.
In regards to AMQP, I would say a really cool feature is the "exchange", and the ability for it to route to other exchanges. This gives you more flexibility and enables you to create a wide array of elaborate routing typologies which can come in very handy when scaling. For a good example, see:
(source: springsource.com)
and: http://blog.springsource.org/2011/04/01/routing-topologies-for-performance-and-scalability-with-rabbitmq/
Finally, in regards to Redis, yes, it can be used as a message broker, and can do well. However, Rabbitmq has more message queuing features than Redis, as rabbitmq was built from the ground up to be a full-featured enterprise-level dedicated message queue. Redis on the other hand was primarily created to be an in-memory key-value store(though it does much more than that now; its even referred to as a swiss army knife). Still, I've read/heard many people achieving good results with Redis for smaller sized projects, but haven't heard much about it in larger applications.
Here is an example of Redis being used in a long-polling chat implementation: http://eflorenzano.com/blog/2011/02/16/technology-behind-convore/
PostgreSQL 9.5
PostgreSQL 9.5 incorporates SELECT ... FOR UPDATE ... SKIP LOCKED. This makes implementing working queuing systems a lot simpler and easier. You may no longer require an external queueing system since it's now simple to fetch 'n' rows that no other session has locked, and keep them locked until you commit confirmation that the work is done. It even works with two-phase transactions for when external co-ordination is required.
External queueing systems remain useful, providing canned functionality, proven performance, integration with other systems, options for horizontal scaling and federation, etc. Nonetheless, for simple cases you don't really need them anymore.
Older versions
You don't need such tools, but using one may make life easier. Doing queueing in the database looks easy, but you'll discover in practice that high performance, reliable concurrent queuing is really hard to do right in a relational database.
That's why tools like PGQ exist.
You can get rid of polling in PostgreSQL by using LISTEN and NOTIFY, but that won't solve the problem of reliably handing out entries off the top of the queue to exactly one consumer while preserving highly concurrent operation and not blocking inserts. All the simple and obvious solutions you think will solve that problem actually don't in the real world, and tend to degenerate into less efficient versions of single-worker queue fetching.
If you don't need highly concurrent multi-worker queue fetches then using a single queue table in PostgreSQL is entirely reasonable.

wait for transactional replication in ADO.NET or TSQL

My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.