how can we monitor latency between process in kdb? - kdb

What is the best way/tool to calculate the latency in my current kdb setup?
This is how data is flowing :
linehandler LH pub -> TP -> RDB
For example1: LH pub at 22:00:00, TP got at 22:00:10 and RDB got at 22:00:20
For example2: LH pub at 22:00:00, TP got at 22:00:10 and RDB got at 22:00:50
So the latency table will look like this
tableName | Total Latency (RDB-LH)| (TP-LH)| (RDB-TP)
tab1 |00:00:20 |00:00:10|00:00:10
tab2 |00:00:50 |00:00:10|00:00:40
We are getting updates from LH to TP in both ways (periodically and irregular updates)
Not sure what is the best to capture latency here between the process.
Any tool which already captures it or do I need to set it up from scratch?
Note: We cannot have new column introduced bec then all prod setup and downstream subs system will break bec of new cols.

It sounds like you have everything you need to monitor it with the time columns you have available.
If you are using ITRS/Geneos you can have scripts to query rdb to check the latency between those time columns or have it periodically output to a log an monitor it's output. Although any monitoring system could be subbed in here if it supports bash scripting and/or log tailing.
However, looking at rdb latency may not be very useful. User queries which take 30 seconds or more would cause the next few tp updates to lag behind if comparing to tp time. If you think the rdb is underperforming you should just test the upd functionality in a UAT environment under high load to verify it performs ok.
The other thing to be wary of is if -t is set on tp for batching. Although typically this can be 1/5 second, you should check what value it has and account for it in latency statistics.
Usage logging on the rdb may also be useful as if you are seeing significant gaps between tp -> rdb you should review what users are doing on the process. Although this defeats the purpose of having an rdb if users can't query it which is the likely cause of latency.
The other option is to measure latency outside of the rdb in a dedicated process such as TorQ's monitoring/data quality engine processes which are out of the box and help keep your rdb leaner: https://www.aquaq.co.uk/torq-data-quality-system/
*Disclaimer I work for AquaQ

Related

Automatic vacuum of table "cloudsqladmin.public.heartbeat"

We're experiencing some constant outages in our back-end that seem to correlate with peaks of high CPU usage for our Cloud SQL Postgres instance (v9.6)
Taking a look to the cloudsql.googleapis.com/postgres.log, those high CPU peaks seems to also correlate to when the database is running an automatic vacuum of table cloudsqladmin.public.heartbeat
We haven't found any documentation on what this table is and why is running autovacuum so often (our own tables doesn't seem to be affected by it).
Is this normal? Should we tune the values for the autovacuum? Thanks in advance.
By looking at your graphs there is no correlation between the CPU and the cloudsqladmin.public.heartbeat autovacuum.
Lets start by what the cloudsqladmin.public.heartbeat table is, this is a table used by the Cloud SQL High Availability process, this is better explained here:
Each second, the primary instance writes to a system database as a
heartbeat signal.
So the table is used internally to keep track of your instance's health. The autovacuum is triggered based on the doc David shared.
Now, if the Vacuum process generated the CPU spike, you would see the spike every minute/second.
So, straight answers to your questions:
Is this normal? : Yes, the autovacuum and the cloudsqladmin.public.heartbeat table are completely normal from a Cloud SQL internal perspective, they should not impact in any way the Instance.
Should we tune the values for the autovacuum? : No need for that, as mentioned, this process is not the one impacting the CPU Instance, you can hide the similar logs including "cloudsqladmin.public.heartbeat" and analyze the ones left on the time the Spike was presented.
It is worth looking at the backup processes triggered too (there could be one on the same time) Cloud SQL > Instance Details > Backups, but of course, that's a different topic than the one described here :) .
Here's a recommendation that seems very relevant to your situation: https://www.netiq.com/documentation/cloud-manager-2-5/ncm-install/data/vacuum.html

Most efficient memory type for kdb+

I am currently configuring a server that will run a kdb+ tickerplant with several subscription processes. Is there an optimal physical memory type for realtime kdb data?
Checkout the type sizes at http://code.kx.com/q/ref/card/#datatypes
Answer depends on what you mean by "efficient" - by the far the largest hit you take in latency is memory allocation, so the less you have to allocate the better. That means smaller types.
But of course you have to weigh that up against your use cases.
For your realtime always make sure the tickerplant inserts the time column so that #s is maintained on the time column for efficient querying.
The tickerplant itself publishes on a timer - the longer the timer the less hit on cpu, but then the tp is collecting data for a while before publishing. Again, weigh up against use cases. BTW make sure your tickerplant is writing the log file to a fast local disk so as to decrease pub delay and iowait.
If you're operating high load from multiple sources, consider OS tweaks too like tcp quickack ( http://www.techrepublic.com/article/take-advantage-of-tcp-ip-options-to-optimize-data-transmission/). There's similar tweaks for memory allocation and disk i/o.

Informix to Postgres, continuous data replication algorithm

The master server is Informix, version varies from 9.40 to the latest, database is unlogged by design that can't be changed. Slave server is the latest PostgreSQL. Master and slave are separate machines, network latency is unpredictable. Master schema is statically defined, well known and does not change, so it's only the data that needs to be replicated. In the master, there are three types of tables:
Numeric data tables, usually one date column, one time column and 15-300 int columns keyed by 2-3 primary keys. The data is never changed, only added once in a set interval (15, 30, or 60 minutes) and deleted when the retention point is reached. Replication data set can be up to 80,000 rows but usually is in the range of hundreds. This data needs to be replicated one way, master to slave. There is about 30 tables of this type and they need to be replicated all at once and as fast as possible, typically in under one minute after new interval set has been committed to the master.
Mixed data tables, with date, time, int, and string types, 30-100 columns, again 2-3 primary keys. This data is also never changed, added continuously and is deleted when the retention point is reached. The data set is up to 100,000 rows per hour. One way replication is needed, master to slave. There are a few tables like that, less than 5 usually.
Mixed data tables, with int and string types, less than 10 columns, 2-3 primary keys. The data largely stays intact, with occasional additions, edits or deletions. The usual replication set size is unpredictable, but probably will be in low hundreds of rows. This data needs to be replicated both ways, as fast as possible. There are a few tables of this type, and they need to be synched independently.
I've been looking for an existing tool that could do what I need, but it looks like there is none that is open source. I'm probably going to write one for my needs, and I'm looking for advice from DB gurus on how to approach this task.
In my estimate, there's probably no single algorithm that would cover all the use cases so I may be in fact looking for two or three algorithms. Here's what I found so far:
Fire trigger on master changes, record row OIDs (does Informix have them?) to temp table, dump the changed rows to a file, transfer it and load up. Question: how to buffer the trigger? The master DB is unlogged (no transactions), so trigger will fire upon each INSERT. Additional strain on the master, not good.
Add a cron job on the slave that will pull latest date/time keys from the master, and if the data is newer, pull it. Problem: although the update interval is defined, in reality it's based on the data source clock (not master DB clock) which is guaranteed to vary from slave server clock. More of it, there can be several data sources, each with varying clocks, and the data needs to be replicated ASAP. The only way here that I see is to constantly poll the master from the slave, hoping that by the time the poll comes in, the data is all committed (no transactions, remember?). Kludgy, slow, not good.
Add Informix as foreign data wrapper in the Postgres and run queries directly instead of bothering with replication. Pros: simplicity. Cons: Informix connector seems to be in alpha stage, and the whole approach is an unknown factor at best.
I've been researching this topic for some time, and it seems that the core of the problem is the lack of transactions on the master side. If the master DB was logged, it would be much easier to replicate it, but without transactions the task suddenly becomes much more complicated. For one, how do I ensure that there are no dupes? Another one, how to avoid update loops in type 3 tables? Considering all that, how to make replication as fast-reacting as possible? I mean the delay between data update and sync start here, data transfer is another topic altogether.
Any input is appreciated.
If you can't change the master in any significant way you are going to have a heck of a time with any sort of replication. Your basic problem is that you have no real way to handle replicating changes in real time without tracking which changes have been replicated, and if you can't change the master, you can't add that. So the short answer is that replication is not a solution which can work for you. Given some of Informix's other features I would think twice about going about this as continuous replication.
This leads to other approaches. The big unknown factors are that networks may not be reliable enough to just link the databases. This could lead to transactions hanging while waiting for data off a high latency connection to all kinds of other problems. You might be able to get this to work with an odbc fdw and an informix provider or with DBI-Link and DBD::Informix, but this strikes me as a problem in your current environment. You could use these in a cron job to populate a second PostgreSQL server closer to your own location periodically, however and so I would not write the approach entirely off.
One way or another it seems to me you need to get a copy of the data to your PostgreSQL server. You may want to do an ETL job to import the data periodically. You may want to use a secondary postgresql server and FDW's or DBI-Link to pull in the data. But this is not likely to be real-time, it is not likely to be continuous.
The tl;dr is that your environment isn't really set up to do this. For my money I would recommend an ETL approach and accept that your slave will not be in sync with the master.

NServiceBus: How to stop distributor acting as a processing bottleneck (reduces rate 65%)

We have an event processing system that will process events sent directly from the source to handler process at 200 eps (events per second). The queues and message sends are transactional. Adding the NSB distributor between the event generator and the handler process reduces this rate from 200 eps to 70 eps. The disk usage and CPU on the distributor box become significantly higher as well.
Seen with commercial build of NServiceBus, version 2.6.0.1505.
Has anyone else seen this behaviour or have any advice?
One thing you can play with is where MSDTC is located. You can have your workers use the same MSDTC as the distributor, therefore downgrading the level of the transaction and speeding up commits. I would recommend if you do this that you cluster MSDTC to protect against failures.
Assuming you are operating on a DB you could shard your databases to work on different sets of data. You could also move the DB(s) closer to the workers(to the same machine).
I would also check into the settings of your DB provider and MSMQ as there are a few things to tweak there in terms of timeouts and such. Note that there is a trade off when applying certain settings, but it sounds like you'd prefer the quickest throughput.
There are lots of other system level things to check, I'll assume you've been through all those items(network/disk/RAM/etc).

Slony-I replication CPU usage

I have recently had to install slony (version 2.0.2) at work. Everything works fine, however, my boss would like to lower the cpu usage on slave nodes during replication. Searching on the net does not reveal any blatantly obvious answers to this. Any suggestions that would help reduce CPU usage (or spread the update out over a longer period) would be very much appreciated!
Have you looked into general PostgreSQL tuning here? The server can waste a lot of CPU cycles doing redundant work if it's not given enough resources to work with, and the default config is extremely small. Tuning Your PostgreSQL Server is a useful guide here, shared_buffers and checkpoint_segments are the two parameters you might get some significant improvement from on a slave (many of the rest only really help for improving query time).
Magnus might be right, this could very well just be a symptom of the fact that your database has very high traffic. Slony effectively multiplies the resource usage of any given DML operation: not only is data CRUD'ed to the replication master, but every time that happens, a Slony trigger (think of it as a change listener) generates an identical transaction and forwards it to the Slon process, which runs it on other members of the cluster.
However, there are two other possible explanations/solutions to this issue:
A possible solution might be to run the slon processes on a separate machine from your database hosts. Even if you have a single-master/single-slave replication scheme, it is advantageous in terms of stability, role-segregation, and performance (that’s you) to run the slon replication daemons on a physically different set of hardware (on the same LAN segment, ideally). There is nothing about Slony that says it has to run on the same machine as a given database host, so putting it in a different location (think “traffic controller”) might relieve some of the resource load on your database hosts. This is also a good idea in terms of both machine stability and scalability.
There's also a chance that this is only a temporary problem caused by the fact that you recently started using Slony. When you first subscribe a new node to a replication set, that node (and, to some extent, its parent) experiences VERY heavy CPU load (and possibly disk load as well) during the subscription process. I'm not sure how it works under the covers, but, depending on how much data was already on the node subscribed, Slony will either check the master’s data against every single piece of data present on the slave in tables that are replicated, and copy data down to the slave if it is missing or different. These are potentially CPU-intensive operations. Especially in large databases, the process of subscription can take a very long time (it took over a day for me, but our database is over 20GB), during which CPU load will be very high. A simple way to see what Slony is up to is to use pgAdmin’s Server Status viewer, which, while limited, will give you some useful info here. If there are a lot of “prepare table for replication” or “cleanup table after replication” operations in progress on the node that has a high CPU load, it’s probably because a subscription isn’t complete. pgAdmin’s status viewer isn’t too informative, however; there are more reliable ways of checking subscription progress using Slony directly. Section 4.7.6.4 in the Slony log-analysis documentation might help with that, as would reading the doc for SUBSCRIBE SET (pay special attention to the boxed warning message, and the "Dangerous/Unintuitive Behavior" section. A simple yet definitive hack to tell whether a set is still in the process of subscriptions is to run a MERGE SET and try to merge it with an empty (or not) other set. MERGE SET will fail with a "subscriptions in progress" error if subscription is still running. However, that hack won't work on Slony 2.1; MERGE SET will just wait until subscriptions are finished.
The best way to reduce the CPU usage would be to put less data into the database :-)
Other than that, you can experiment with sync_interval. It may be what you're looking for.