Postgres Replication Slots Checking Lag - postgresql

I'm attempting to detect on my AWS RDS Aurora Postgres 11.9 instance if my three Logical Replication slots are backing up. I'm using wal2json plugin to read off of them continuously. Two of the slots are being read off by python processes. The third is kafka-connect consumer.
I'm using the below query, but am getting odds results. It is saying two of my slots are several GB behind even in the middle of the night when we have very small load. Am I misinterpreting what the query is saying?
SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
Things I've checked:
I've checked that the consumers are still running.
I have also looked at the logs and the timestamps of the rows being inserted are coming off the database within 0-2 seconds after they were inserted. So it doesn't appear like I'm lagging behind.
I've performed an end-to-end test and the data is making it through my pipeline in a few seconds, so it is definitely consuming data relatively fast.
Both of the slots I'm using for my python processes have the same value for GB_behind, currently 12.40. Even though the two slots are on different logical databases which have dramatically different load (one is ~1000x higher load).
I have a 3rd replication slot being read by a different program (kafka connect). It shows 0 GB_behind.
There is just no way, even at peak load, that my workloads could generate 12.4 GBs of data in a few seconds(not even in a few minutes). Am I miss interpreting something? Is there a better way to check how far a replication slot is behind?
Thanks much!
Here is a small snippet of my code(python3.6) in case it helps, but I've bene using it for awhile and data has been working:
def consume(msg):
print(msg.payload)
try:
kinesis_client.put_record(StreamName=STREAM_NAME, Data=msg.payload, PartitionKey=partition_key)
except:
logger.exception('PG ETL: Failed to send load to kinesis. Likely too large.')
with con.cursor() as cur:
cur.start_replication(slot_name=replication_slot, options = {'pretty-print' : 1}, decode=True)
cur.consume_stream(consume)

I wasn't properly performing send_feedback during my consume function. So I was consuming the records, but I wasn't telling the Postgres replication slot that I had consumed the records.
Here is my complete consume function in case others interested:
def consume(msg):
print(msg.payload)
try:
kinesis_client.put_record(StreamName=STREAM_NAME, Data=msg.payload, PartitionKey=partition_key)
except:
logger.exception('PG ETL: Failed to send load to kinesis. Likely too large.')
msg.cursor.send_feedback(flush_lsn=msg.data_start)
with con.cursor() as cur:
cur.start_replication(slot_name=replication_slot, options = {'pretty-print' : 1}, decode=True)
cur.consume_stream(consume)

Related

PostgreSQL ANALYZE statisticts & Replication

On my primary I ran a VACUUM then an ANALYZE on all databases, then when I check pg_stat_user_tables, the last_analyze column shows a current timestamp which is great.
When I check my replication instance, there are no values in the last_analyze column. I was assuming this timestamp would also eventually populate? Is this known behaviour?
The reason I ask is that after that VACUUM/ANALYZE on the primary, I'm running into some extremely slow queries on the replication instance. I ran an EXPLAIN plan prior to the VACUUM/ANALYZE on a query and it ran in 5 seconds... now it's taking 65 seconds. The EXPLAIN shows it's not using a lot of indexes that it should be.
PostgreSQL has two different stats systems. One records data about the distribution of values in the columns, this is transactional. It propagates to the replica via the WAL.
The other system records data about turn over on the tables and data on when the last vac/an was done. This system is used to determine when to schedule new vac/an (to prevent the first system from getting too out of date). This one is not transactional, and does not propagate to the replica.
So the replica has the latest column value distribution statistics (as soon as the WAL replays, anyway), but it doesn't know how recent they are.

Loading data to Postgres RDS is still slow after tuning parameters

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?
In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.

Large number of connections to kdb

I have a grid with over 10,000 workers, and I'm using qpython to append data to kdb. Currently with 1000 workers, I'm getting ~40 workers that fail to connect and send data on the first try, top shows q is at 100% cpu when that happens. As I scale to 10k workers, the problem will escalate. The volume of data is only 100MBs. I've tried running extra slaves, but kdb tells me I can't use it with -P option, which I'm guessing I need to use qpython. Any ideas how to scale to support 10k workers. My current idea is to write a server in between that will buffer write requests and pass them to kdb, is there a better solution?
It amazes me that you're willing to dedicate 10,000 cpus to Python but only a single one to Kdb.
Simply run more Kdb cores (on other ports) and then, enable another process to receive the updates from the ingestion cores. The tickerplant (u.q) is a good model for this.

Spark Cassandra Connector has bad read performance, produces a lot of disk writes by kworker

Some followups at the bottom
I have a test installation of Spark and Cassandra where I have 6 nodes with 128GiB of RAM and 16 CPU cores each. Each node runs Spark and Cassandra. I set up my keyspace with the SimpleStrategy and a replication factor of 3 (i.e., fairly standard).
My table is very simple, like this:
create table if not exists mykeyspace.values (channel_id timeuuid, day int, time bigint, value double, primary key ((channel_id, day), time)) with clustering order by (time asc)
time is simply a unix timestamp in nanoseconds (the measuring devices the values come from are that precise and this precision is wanted), day is this timestamp in days (i.e., days since 1970-01-01).
I now inserted about 200 GiB of values for about 400 channels and tested a very simple thing - calculate the 10-minute average of every channel:
sc.
cassandraTable("mykeyspace", "values").
map(r => (r.getLong("time"), r.getUUID("channel_id"), r.getDouble("value"))).
map(t => (t._1 / 600L / 1000000000L, t._2) -> (t._3, 1.0)).
reduceByKey((a, b) => (a._1 + b._1) -> (a._2 + b._2)).
map(t => (t._1._1 * 600L * 1000000000L, t._1._2, t._2._1 / t._2._2))
when I now do this calculation, even without saving the result (i.e., by using a simple count()) this takes a VERY long time and I have a very bad read performance.
When I do top on the nodes, Cassandra's java process takes about 800% CPU, which is OK because this is about half the load the node can take; the other half is taken by Spark.
However, I noticed a strange thing:
When I run iotop I expect to see a lot of disk read, but I see a lot of disk WRITE instead, all of which comes from kworker.
When I do iostat -x -t 10, I also see a lot of writes going on.
Swap is disabled.
When I run a similar calculation directly on the CSV files the data came from, which are stored in HDFS and loaded via sc.newAPIHadoopFile with a custom input format, the process finishes much faster (the calculation takes about an hour with Cassandra but about 5 minutes with files from HDFS).
So where can I start troubleshooting and tuning?
Followup 1
With the help of RussS (see comments) I discovered that logging was set to DEBUG. I disabled this, set logging to ERROR, and also disabled GC logging, but this did not change anything at all.
I also tried keyBy as the very same user pointed out, but this also did not change anything.
I also tried doing it locally, I tried it once from .net and once from Scala, and here, the database is accessed as expected, i.e., no writes.
Followup 2
I think I got it. For once, I didn't see the forest for the trees, because the hour I stated earlier for 200GiB is still about 56 MiB/s throughput. Since the hardware I run my installation on is far from optional (it is a high performance server which runs Microsoft HyperV which in turn runs the nodes virtually, and the hard disks of this machine are quite slow) this is indeed a throughput I expect. Since the host is just one machine with one RAID array where the disks of the nodes are virtual HDDs, I can't expect the performance to magically go through the roof.
I also tried to run Spark standalone which improves the performance a bit (I now get about 75 MiB/s), and also the constant writes are gone with this - I only get occasional spikes I expect because of shuffling.
For the CSV files being much faster, the reason is that the raw data in CSV is about 50 GiB and my custom FileInputFormat that reads it, does it line by line, and is also using a very fast string-to-double parser which only knows the US format but is faster than Java's parseDouble or Scala's toDouble. With this special tweaking I get about 170MiB/s in YARN mode.
So I suppose I should, for once, improve my CQL queries to limit the data that gets read, and try to tweak some YARN settings.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).