How do I make sure I can keep ingestion client running if I have heavy read operations on a table? - postgresql

I am using influx line to insert records into a table in Questdb at a constant and high rate. I have multiple postgres clients attached performing read operations, some are Grafana dashboards which do some heavy aggregations across the table. It looks like when I refresh the dashboards, I'm hitting some issues:
... t.LineTcpConnectionContext [31] queue full, consider increasing queue size or number of writer jobs
Is there a way to make sure I don't kick the kick the insert client out or increase the queue like mentioned in the error?

If you have one client writing Influx line protocol over TCP, it's possible to have a dedicated worker thread for this purpose. The config key that you can set this with is line.tcp.worker.count and this should be set in a configuration file or via environment variable. Setting one dedicated thread in server.conf would look like the following:
line.tcp.worker.count=1

Related

Handle replication lag in data pipeline

Given a pipeline of tasks that run in sequence. Each task consumes data from the database, manipulate it, and produces (write) to the same database.
We are using AWS RDS Aurora, and in order to spread the load, the “reading phase” of each task is done within the read replica.
In some cases of high loads, we reach replication lag of 10-15 seconds. This means that by the time the new task consume data, it gets wrong/missing data points.
We know this is not the “right” way to design such pipeline, and it contradicts the idiom “Do not communicate by sharing memory; instead, share memory by communicating”.
Since it’s too much effort to change the design now, we come up with alternative solution:
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node.
This is not optimal, and I would like to hear other solution to bypass this issue.
It is worth mentioning that we are using Celery (& Python) to construct such workflow and each task is unaware of the tasks that ran previously.
There will always be data which is inserted into the database but not yet visible, either because it wasn't committed yet, it was committed after your snapshot was started, or due to replication lag. The only real solution is to make your tasks robust to this inevitability.
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node.
You want to shed load from the master until the first sign of trouble, then you want to suddenly dump all the load back onto it?
Create a service that check replication lag value and expose it to all tasks. If the value is greater than x, task will fallback to read from RDS master node
Depending on the cause of your replication lag this might make things worse due to further increasing the load on the master node.
If your pipeline allows it you could wait in Task A, after write, until the data propagated to the read replica.

How to keep track of Cassandra write successes while using Kafka in cluster

When working in my cluster I have the constraint that my frontend cannot display a finished job until all the jobs different results have been added into Cassandra. These result are computed in their individual microservices and sent via Kafka to a cassandra writer.
My question is if there are any best practices for letting the frontend know when these writes have completed? Should I make another database entry for results or is there some other smart way that would scale well?
Each job has about 100 different results written in to it, and I have like 1000jobs/day
I used Cassandra for a UI backend in the past with Kafka , and we would store a status field in each DB record, which would very periodically get updated through a slew of Kafka Streams processors (there were easily more than 1000 DB writes per day).
The UI itself was running some setInterval(refresh) JS function that would query the latest database state, then update the DOM, accordingly.
Your other option is to push some websocket/SSE data into the UI from some other service that indicates "data is finished"

System architecture - ETL

We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.
Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/
Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files

How do I manage connection pooling to PostgreSQL from sidekiq?

The problem I have a rails application that runs a few hundred sidekiq background processes. They all connect to a PostgreSQL database which is not exactly happy about providing 250 connections - it can, but if all sidekiq processes accidentally send queries to the db, it crumbles.
Option 1 I have been thinking about adding pgBouncer in front of the db, however I cannot currently use it's transactional mode, since I'm highly dependent upon setting the search_path at the beginning of each job processing for determining which "country" (PostgreSQL schema) to work on (apartment-gem). In this case, I would have to use the session based connection pooling mode. This however would, as far as I know, require me to disconnect the connections after each job processing, to release the connections back into the pool, and that would be really costly performance wise wouldn't it? Am I missing out on something?
Option 2 use application layer based connection pooling is of cause also an option, however I'm not really sure how I would be able to do that for PostgreSQL with sidekiq?
Option 3 something I have not thought of?
Option 1: You're correct, sessions would require you to drop and reconnect and that adds overhead. How costly would be dependent on access pattern ie what fraction of the connection/tcp handshake etc is of the total work done and what sort of latency you need. Definitely worth benchmarking but if the connections are short lived then the overhead will be really noticeable.
Option 2/3: You could rate limit or throttle your sidekiq jobs. There are a few projects here tackling this...
Queue limits
Sidekiq Limit Fetch: Restrict number of workers which are able to run specified queues simultaneously. You can pause queues and resize queue distribution dynamically. Also tracks number of active workers per queue. Supports global mode (multiple sidekiq processes). There is an additional blocking queue mode.
Sidekiq Throttler: Sidekiq::Throttler is a middleware for Sidekiq that adds the ability to rate limit job execution on a per-worker basis.
sidekiq-rate-limiter: Redis backed, per worker rate limits for job processing.
Sidekiq::Throttled: Concurrency and threshold throttling.
I got the above from here
https://github.com/mperham/sidekiq/wiki/Related-Projects
If your application must have a connection per process and you're unable to break it up where more threads can use a connection then it's pgBouncer or Application based connection pooling. Connection pooling is in effect either going to throttle or limit your app in some way in order to save the DB.
Sidekiq should only require one connection for each worker thread. If you are setting your concurrency to a reasonable value, say 10-25, I don't think you should be using 250 simultaneous database connections. How many worker processes are you running, and what is their concurrency?
Also, you can see on that page that even if you have a high concurrency setting, you can still create a connection pool shared by the threads within that process.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.