clickhouse:SYSTEM STOP MERGES does not work - merge

In factally,I want to use SYSTEM STOP MERGES TABLE to stop merge and then migrate some parts to another shard. My operation as follow:
1.
clickhouse-client -q "insert into DTDB.lineorder_demo FORMAT CSV" < lineorder.tbl;
I execute
SYSTEM STOP MERGES DTDB.lineorder_demo
immediately after the import is complete. But within 15 minutes, the parts will still merge and decrease.

Related

Execute Multiple SQL Insert Statements in parallel in Snowflake

I have a question about how it works when several SQL statements are executed in parallel in Snowflake.
For example, if I execute 10 insert statements on 10 different tables with the same base table - will the tables be loaded in parallel?
Since Copy and Insert statement only write in new partitions they can run in parallel with other Copy or Insert statements.
https://docs.snowflake.com/en/sql-reference/transactions.html
"https://docs.snowflake.com/en/sql-reference/transactions.html#transaction-commands-and-functions" states that "Most INSERT and COPY statements write only new partitions. Those statements often can run in parallel with other INSERT and COPY operations,..."
I assume that statements cannot run in parallel when they want to insert into the same micro partition. Is that correct or is there another explanation why locks on INSERTs can happen?
I execute 10 insert statements on 10 different tables
with the same base table - will the tables be loaded in parallel?
YES!
Look for multi-table insert in SF https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table.html
You can execute queries parallelly by just adding a ">" symbol.
for example:
The below statement will submit all the mentioned queries parallelly to snowflake. It will not exit out though if there is any error encountered in any of the queries.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;>select * from SNOWSQLTABLE;>select 2;>select 3;>insert into TABLE values (1)>;select * from SNOWLTABLE;>select 5;"
The below statement will cause the queries to run one at a time and exit if any error is found.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;select * from SNOWSQLTABLE;select 2;select 3;insert into SNOQSQLTABLE values (1);select * from SNOWLTABLE;select 5;"
Concurrency in Snowflake is managed with either multiple warehouses (compute resources) or enabling multi-clustering on a warehouse (one virtual warehouse with more than one cluster of servers).
https://docs.snowflake.com/en/user-guide/warehouses-multicluster.html
I'm working with a customer today that does millions of SQL commands a day, they have many different warehouses and most of these warehouses are set to multi-cluster "auto-scale" mode.
Specifically, for your question, it sounds like you have ten sessions connected, running inserts into ten tables via querying a single base table. I'd probably begin my testing of this with one virtual warehouse, configured with a minimum
of one cluster and a maximum of three or four, and then run tests and review the results.
The size of the warehouse I would use would mostly be determined by how large the query is (the SELECT portion), you can start with something like a medium and review the performance and query plans of the inserts to see if that is the appropriate size.
When reviewing the plans, check for queuing time to see if perhaps three or four clusters isn't enough, it probably will be fine.
Your query history will also indicate which "cluster_number" your query ran on, within the virtual warehouse. This is one way to check to see how many clusters were running (the maximum cluster_number), another is to view the warehouses tab in the webUI or to execute the "show warehouses;" command.
Some additional links that might help you:
https://www.snowflake.com/blog/auto-scale-snowflake-major-leap-forward-massively-concurrent-enterprise-applications/
https://community.snowflake.com/s/article/Putting-Snowflake-s-Automatic-Concurrency-Scaling-to-the-Test
https://support.snowflake.net/s/question/0D50Z00009T2QTXSA3/what-is-the-difference-in-scale-out-vs-scale-up-why-is-scale-out-for-concurrency-and-scale-up-is-for-large-queries-

How can I automatically maintain a dump of modified rows in PostGreSql

So, I have a PostGreSQL DB. For some chosen tables in that DB I want to maintain a plain dump of the rows when modified. Note this dump is not a recovery or backup dump. It is just a file which will have the incremental rows. That is, whenever a row is inserted or updated, I want that appended to this file or to a file in a folder. Idea is to load that folder into say something like hive periodically so that I can run queries to check previous states of certain rows, columns. Now, these are very high transactional tables and the dump does not need to be real time. It can be in batches, every hour. I want to avoid a trigger firing hundreds of times every minute. I am looking for something which is off the shelf - already available in PostGreSQL. I did some research but everything is related to PostGreSQL backup - which is not the exact use case.
I have read some links like https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/ Implementing history of PostgreSQL table etc - but these are based on insert update trigger and create the history table on PostGreSQL itself. I want to avoid both. I cannot have the history on PostGreSQL as it will be huge soon. And I do not want to keep writing to files through a trigger firing constantly.

Bidirectional Replication Design: best way to script and execute unmatched row on Source DB to multiple subscriber DBs, sequentially or concurrently?

Thank you for help or suggestion offered.
I am trying to build my own multi-master replication on Postgresql 10 in Windows, for a situation which cannot use any of the current 3rd party tools for PG multimaster replication, which can also involve another DB platform in a subscriber group (Sybase ADS). I have the following logic to create bidirectional replication, partially inspired by Bucardo's logic, between 1 publisher and 2 subscribers:
When INSERT, UPDATE, or DELETE is made on Source table, Source table Trigger adds row to created meta table on Source DB that will act as a replication transaction to be performed on the 2 subscriber DBs which subcribe to it.
A NOTIFY signal will be sent to a service, or script written in Python or some scripting language will monitor for changes in the metatable or trigger execution and be able to do a table compare or script the statement to run on each subscriber database.
***I believe that triggers on the subscribers will need to be paused to keep them from pushing their received statements to their subscribers, i.e. if node A and node B both subscribe to each other's table A, then an update to node A's table A should replicate to node B's table A without then replicating back to table A in a bidirectional "ping-pong storm".
There will be a final compare between tables and the transaction will be closed. Re-enable triggers on subscribers if they were paused/disabled when pushing transactions from step 2 addendum.
This will hopefully be able to be done bidirectionally, in order of timestamp, in FIFO order, unless I can figure out a to create child processes to run the synchronizations concurrently.
For this, I am trying to figure out the best way to setup the service logic---essentially Step 2 above, which has apparently been done using a daemon in Linux, but I have to work in Windows, making it run as, or resembling, a service/agent---or come up with a reasonably easy and efficient design to send the source DBs statements to the subscribers DBs.
Does anyone see that this plan is faulty or may not work?
Disclaimer: I don't know anything about Postgresql but have done plenty of custom replication.
The main problem with bidirectional replication is merge issues.
If the same key is used in both systems with different attributes, which one gets to push their change? If you nominate a master it's easier. Then the slave just gets overwritten every time.
How much latency can you handle? It's much easier to take the 'notify' part out and just have a five minute windows task scheduler job that inspects log tables and pushes data around.
In other words, this kind of pattern:
Change occurs in a table. A database trigger on that table notes the change and writes the PK of the table to a change log table. A ReplicationBatch column in the log table is set to NULL by default
A windows scheduled task inspects all change log tables to find all changes that happened since the last run and 'reserves' these records by setting their replication state to a replication batch number
i.e. you run a UPDATE LogTable Set ReplicationBatch=BatchNumber WHERE ReplicationState IS NULL
All records that have been marked are replicated
you run a SELECT * FROM LogTable WHERE ReplicationState=RepID to get the records to be processed
When complete, the reserved records are marked as complete so the next time around only subsequent changes are replicated. This completion flag might be in the log table or it might be in a ReplicaionBatch number table
The main point is that you need to reserve records for replication, so that as you are replicating them out, additional log records can be added in from the source without messing up the batch
Then periodically you clear out the log tables.

Postgresql / INSERT performance fall when doing a single SELECT during a batch of INSERTs

Context:
Using PostgreSQL (9.6), for a custom synchronisation project, we have an agent that make a lot of INSERTs between a database_1 and database_2 when syncing data.
For example: DB2 is down during 5 minutes, there are 40,000 new lines in DB1, so when DB2 is up again, all the 40,000 lines will be immediately synced from DB1 to DB2.
All this works great.
Problem/Fact:
During the synchronisation, the INSERT rate is around 1000 lines / second.
However, when we do a simple SELECT count(*) FROM table during the sync (in the middle of these thousands of INSERTs), we noticed that the INSERT rate is falling town to a few dozens per second (instead of 1000x per second).
Question:
Is there any reason why a SELECT operation (made inside pgAdmin, by another process than the syncing process) is slowing down the batch of INSERT ?
Any locking or internal reason that might explain this?
Or should I provide more information? How can I debug more?
Hints:
Logs are fully activated and all the INSERTs always take around 0.700ms (before slowdown and same during slowdown), it doesn't change.
INSERTs are currently performed one row by one row
(I'll be happy to provide more information)

is it possible to fork a mysqldump of data?

I am restoring a mysql database with perl on a remote server with about 30 million records. It's taking > 2 days & looking at my network connections I am not fully utilizing my uplink bandwidth. I will need to do this at least 1x per week. Is there a way to fork a mysqldump (I'm using perl) so that I can take full advantage of my bandwidth (I don't mind if I'm choked off for a bit...I just need to get this done faster).
Can't you upload the whole dump to the remote server and start the restore there?
A restore of a mysqldump is just the execution of a long series of commands that would restore your database from scratch. If the execution path for that is; 1) send command 2) remote system executes command 3) remote system replies that the command is complete 4) send next command, then you are spending most of your time waiting on network latency.
I do know that most SQL hosts will allow you to upload a dump file specifically to avoid the kinds of restore time that you're talking about. The company that takes my money each month even has a web-based form that you can use to restore a database from a file that has been uploaded via sftp. Poke around your hosting service's documentation. They should have something similar. If nothing else (and you're comfortable on the command line) you can upload it directly to your account and do it from a shell there.
mk-parallel-dump and mk-parallel-restore are designed to do what you want, but in my testing mk-parallel-dump was actually slower than plain old mysqldump. Your mileage may vary.
(I would guess the biggest factor would be the number of spindles your data files reside on, which in my case, 1, was not especially conducive to parallelization.)
First caveat: mk-parallel-* writes a bunch of files, and figuring out when it's safe to start sending them (and when you're done receiving them) may be a little tricky. I believe that's left as an exercise for the reader, sorry.
Second caveat: mk-parallel-dump is specifically advertised as not being for backups. Because "At the time of this release there is a bug that prevents --lock-tables from working correctly," it's really only useful for databases that you know will not change, e.g., a slave that you can STOP SLAVE on with no repercussions, and then START SLAVE once mk-parallel-dump is done.
I think a better solution than parallelizing a dump may be this:
If you're doing your mysqldump on a weekly basis, you can just do it once (dumping with --single-transaction (which you should be doing anyway) and --master-data=n) and then start a slave that connects over an ssh tunnel to the remote master, so the slave is continually updated. The disadvantage is that if you want to clone a local copy (perhaps to make a backup) you will need enough disk to keep an extra copy around. The advantage is that a week's worth of (query-based) replication log is probably quite a bit smaller than resending the data, and also it arrives gradually so you don't clog your pipe.
How big is your database in total? What kind of tables are you using?
A big risk with backups using mysqldump has to do with table locking, and updates to tables during the backup process.
The mysqldump backup process basically works as follows:
For each table {
Lock table as Read-Only
Dump table to disk
Unlock table
}
The danger is that if you run an INSERT/UPDATE/DELETE query that affects multiple tables while your backup is running, your backup may not capture the results of your query properly. This is a very real risk when your backup takes hours to complete and you're dealing with an active database. Imagine - your code runs a series of queries that update tables A,B, and C. The backup process currently has table B locked.
The update to A will not be captured, as this table was already backed up.
The update to B will not be captured, as the table is currently locked for writing.
The update to C will be captured, because the backup has not reached C yet.
This is an easy way to destroy referential integrity in your database.
Your backup process needs to be atomic, and transactional. If you can't shut down the entire database to writes during the backup process, you're risking disaster.
Also - there must be something wrong here. At a previous company, we were running nightly backups of a 450G Mysql DB (largest table had 150M rows), and it took less than 6 hours for the backup to complete.
Two thoughts:
Do you have a slave database? Run the backup from there - Stop replication (preventing RW risk), run the backup, restart replication.
Are your tables using InnoDB? Consider investing in InnoDBhotbackup, which solves this problem, as the backup process leverages the journaling that is part of the InnoDB storage engine.