Why do we need message brokers like RabbitMQ over a database like PostgreSQL? - postgresql

I am new to message brokers like RabbitMQ which we can use to create tasks / message queues for a scheduling system like Celery.
Now, here is the question:
I can create a table in PostgreSQL which can be appended with new tasks and consumed by the consumer program like Celery.
Why on earth would I want to setup a whole new tech for this like RabbitMQ?
Now, I believe scaling cannot be the answer since our database like PostgreSQL can work in a distributed environment.
I googled for what problems does the database poses for the particular problem, and I found:
polling keeps the database busy and low performing
locking of the table -> again low performing
millions of rows of tasks -> again, polling is low performing
Now, how does RabbitMQ or any other message broker like that solves these problems?
Also, I found out that AMQP protocol is what it follows. What's great in that?
Can Redis also be used as a message broker? I find it more analogous to Memcached than RabbitMQ.
Please shed some light on this!

Rabbit's queues reside in memory and will therefore be much faster than implementing this in a database. A (good)dedicated message queue should also provide essential queuing related features such as throttling/flow control, and the ability to choose different routing algorithms, to name a couple(rabbit provides these and more). Depending on the size of your project, you may also want the message passing component separate from your database, so that if one component experiences heavy load, it need not hinder the other's operation.
As for the problems you mentioned:
polling keeping the database busy and low performing: Using Rabbitmq, producers can push updates to consumers which is far more performant than polling. Data is simply sent to the consumer when it needs to be, eliminating the need for wasteful checks.
locking of the table -> again low performing: There is no table to lock :P
millions of rows of task -> again polling is low performing: As mentioned above, Rabbitmq will operate faster as it resides RAM, and provides flow control. If needed, it can also use the disk to temporarily store messages if it runs out of RAM. After 2.0, Rabbit has significantly improved on its RAM usage. Clustering options are also available.
In regards to AMQP, I would say a really cool feature is the "exchange", and the ability for it to route to other exchanges. This gives you more flexibility and enables you to create a wide array of elaborate routing typologies which can come in very handy when scaling. For a good example, see:
(source: springsource.com)
and: http://blog.springsource.org/2011/04/01/routing-topologies-for-performance-and-scalability-with-rabbitmq/
Finally, in regards to Redis, yes, it can be used as a message broker, and can do well. However, Rabbitmq has more message queuing features than Redis, as rabbitmq was built from the ground up to be a full-featured enterprise-level dedicated message queue. Redis on the other hand was primarily created to be an in-memory key-value store(though it does much more than that now; its even referred to as a swiss army knife). Still, I've read/heard many people achieving good results with Redis for smaller sized projects, but haven't heard much about it in larger applications.
Here is an example of Redis being used in a long-polling chat implementation: http://eflorenzano.com/blog/2011/02/16/technology-behind-convore/

PostgreSQL 9.5
PostgreSQL 9.5 incorporates SELECT ... FOR UPDATE ... SKIP LOCKED. This makes implementing working queuing systems a lot simpler and easier. You may no longer require an external queueing system since it's now simple to fetch 'n' rows that no other session has locked, and keep them locked until you commit confirmation that the work is done. It even works with two-phase transactions for when external co-ordination is required.
External queueing systems remain useful, providing canned functionality, proven performance, integration with other systems, options for horizontal scaling and federation, etc. Nonetheless, for simple cases you don't really need them anymore.
Older versions
You don't need such tools, but using one may make life easier. Doing queueing in the database looks easy, but you'll discover in practice that high performance, reliable concurrent queuing is really hard to do right in a relational database.
That's why tools like PGQ exist.
You can get rid of polling in PostgreSQL by using LISTEN and NOTIFY, but that won't solve the problem of reliably handing out entries off the top of the queue to exactly one consumer while preserving highly concurrent operation and not blocking inserts. All the simple and obvious solutions you think will solve that problem actually don't in the real world, and tend to degenerate into less efficient versions of single-worker queue fetching.
If you don't need highly concurrent multi-worker queue fetches then using a single queue table in PostgreSQL is entirely reasonable.

Related

How to Partition a Queue in a distributed system

This problem accrued to me a while ago, unfortunately, I could not find the answer I was looking for on the web. Here is the problem statement:
Consider a simple producer-consumer environment where we only have one
producer writing to a queue and one consumer reading from it. Now
since the objects written on the queue are quite large in size and our
available resources are not much on our current machine, we decided to
implement a distributed queue system where the data inside the queue
is partitioned among multiple nodes. It is important to us that the
total ordering is conserved while pushing and poping the data,
meaning that from the point of a user this distributed queue acts just
like a single unified queue.
Before giving a solution to this problem we have to ask if high availability is more important to us or portion tolerance. I believe in both versions, there are interesting challenges to tackle and I thought that such a question must surely be raised before, however, after searching for existing solutions I could not find a complete and well-thought-out answer from an algorithmic or scientific point of view. Most of what I found were engineering and high-level approaches, leveraging tools like Kafka, RabitMQ, Redis etc.
So the problem remains and I would be thankful if you could share with me your designs, algorithms and thoughts on this problem or point me to some scientific journal or article etc that has already tackled such a problem.
This can be one of the ways in which the above can be achieved. Here the partitioning is achieved in the round-robin fashion.
To achieve high availability, you can have partition replicas.
Pros:-
By adding replicas system becomes highly available.
Multi-consumer groups can be implemented
Cons:-
route table becomes the single source of failure, hence redundancy can be achieved via using dynamo DB & consistent read here.

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.