Execute Multiple SQL Insert Statements in parallel in Snowflake - snowflake-task

I have a question about how it works when several SQL statements are executed in parallel in Snowflake.
For example, if I execute 10 insert statements on 10 different tables with the same base table - will the tables be loaded in parallel?

Since Copy and Insert statement only write in new partitions they can run in parallel with other Copy or Insert statements.
https://docs.snowflake.com/en/sql-reference/transactions.html

"https://docs.snowflake.com/en/sql-reference/transactions.html#transaction-commands-and-functions" states that "Most INSERT and COPY statements write only new partitions. Those statements often can run in parallel with other INSERT and COPY operations,..."
I assume that statements cannot run in parallel when they want to insert into the same micro partition. Is that correct or is there another explanation why locks on INSERTs can happen?

I execute 10 insert statements on 10 different tables
with the same base table - will the tables be loaded in parallel?
YES!
Look for multi-table insert in SF https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table.html

You can execute queries parallelly by just adding a ">" symbol.
for example:
The below statement will submit all the mentioned queries parallelly to snowflake. It will not exit out though if there is any error encountered in any of the queries.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;>select * from SNOWSQLTABLE;>select 2;>select 3;>insert into TABLE values (1)>;select * from SNOWLTABLE;>select 5;"
The below statement will cause the queries to run one at a time and exit if any error is found.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;select * from SNOWSQLTABLE;select 2;select 3;insert into SNOQSQLTABLE values (1);select * from SNOWLTABLE;select 5;"

Concurrency in Snowflake is managed with either multiple warehouses (compute resources) or enabling multi-clustering on a warehouse (one virtual warehouse with more than one cluster of servers).
https://docs.snowflake.com/en/user-guide/warehouses-multicluster.html
I'm working with a customer today that does millions of SQL commands a day, they have many different warehouses and most of these warehouses are set to multi-cluster "auto-scale" mode.
Specifically, for your question, it sounds like you have ten sessions connected, running inserts into ten tables via querying a single base table. I'd probably begin my testing of this with one virtual warehouse, configured with a minimum
of one cluster and a maximum of three or four, and then run tests and review the results.
The size of the warehouse I would use would mostly be determined by how large the query is (the SELECT portion), you can start with something like a medium and review the performance and query plans of the inserts to see if that is the appropriate size.
When reviewing the plans, check for queuing time to see if perhaps three or four clusters isn't enough, it probably will be fine.
Your query history will also indicate which "cluster_number" your query ran on, within the virtual warehouse. This is one way to check to see how many clusters were running (the maximum cluster_number), another is to view the warehouses tab in the webUI or to execute the "show warehouses;" command.
Some additional links that might help you:
https://www.snowflake.com/blog/auto-scale-snowflake-major-leap-forward-massively-concurrent-enterprise-applications/
https://community.snowflake.com/s/article/Putting-Snowflake-s-Automatic-Concurrency-Scaling-to-the-Test
https://support.snowflake.net/s/question/0D50Z00009T2QTXSA3/what-is-the-difference-in-scale-out-vs-scale-up-why-is-scale-out-for-concurrency-and-scale-up-is-for-large-queries-

Related

Upserting and maintaing postgres table using Apache Airflow

Working on an ETL process that requires me to pull data from one postgres table and update data to another Postgres table in a seperate environment (same columns names). Currently, I am running the python job in a windows EC2 instance, and I am using pangres upsert library to update existing rows and insert new rows.
However, my organization wants me to move the python ETL script in Managed Apache Airflow on AWS.
I have been learning DAGs and most of the tutorials and articles are about querying data from postgres table using hooks or operators.
However, I am looking to understand how to update existing table A incrementally (i.e. upsert) using new records from table B (and ignore/overwrite existing matching rows).
Any chunk of code (DAG) that explains how to perform this simple task would be extremely helpful.
In Apache Airflow, operations are done using operators. You can package any Python code into an operator, but your best bet is always to use a pre-existing open source operator if one already exists. There is an operator for Postgres (https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html).
It will be hard to provide a complete example of what you should write for your situation, but it sounds to be like the best approach for you to take here is to take any SQL present in your Python ETL script and use it with the Postgres operator. The documentation I linked should be a good example.
They demonstrate inserting data, reading data, and even creating a table as a pre-requisite step. Just like how in a Python script, lines execute one at a time, in a DAG, operators execute in a particular order, depending on how they're wired up, like in their example:
create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date
In their example, populating the pet table won't happen until the create pet table step succeeds, etc.
Since your use case is about copying new data from one table to another, a few tips I can give you:
Use a scheduled DAG to copy the data over in batches. Airflow isn't meant to be used a streaming system for many small pieces of data.
Use the "logical date" of the DAG run (https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html) in your DAG to know the interval of data that run should process. This works well for your requirement that only new data should be copied over during each run. It will also give you repeatable runs in case you need to fix code, then re-run each run (one batch a time) after pushing your fix.

DB2 Tables Not Loading when run in Batch

I have been working on a reporting database in DB2 for a month or so, and I have it setup to a pretty decent degree of what I want. I am however noticing small inconsistencies that I have not been able to work out.
Less important, but still annoying:
1) Users claim it takes two login attempts to connect, first always fails, second is a success. (Is there a recommendation for what to check for this?)
More importantly:
2) Whenever I want to refresh the data (which will be nightly), I have a script that drops and then recreates all of the tables. There are 66 tables, each ranging from 10's of records to just under 100,000 records. The data is not massive and takes about 2 minutes to run all 66 tables.
The issue is that once it says it completed, there is usually at least 3-4 tables that did not load any data in them. So the table is deleted and then created, but is empty. The log shows that the command completed successfully and if I run them independently they populate just fine.
If it helps, 95% of the commands are just CAST functions.
While I am sure I am not doing it the recommended way, is there a reason why a number of my tables are not populating? Are the commands executing too fast? Should I lag the Create after the DROP?
(This is DB2 Express-C 11.1 on Windows 2012 R2, The source DB is remote)
Example of my SQL:
DROP TABLE TEST.TIMESHEET;
CREATE TABLE TEST.TIMESHEET AS (
SELECT NAME00, CAST(TIMESHEET_ID AS INTEGER(34))TIMESHEET_ID ....
.. (for 5-50 more columns)
FROM REMOTE_DB.TIMESHEET
)WITH DATA;
It is possible to configure DB2 to tolerate certain SQL errors in nested table expressions.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyfqetnint.html
When the federated server encounters an allowable error, the server allows the error and continues processing the remainder of the query rather than returning an error for the entire query. The result set that the federated server returns can be a partial or an empty result.
However, I assume that your REMOTE_DB.TIMESHEET is simply a nickname, and not a view with nested table expressions, and so any errors when pulling data from the source should be surfaced by DB2. Taking a look at the db2diag.log is likely the way to go - you might even be hitting a Db2 issue.
It might be useful to change your script to TRUNCATE and INSERT into your local tables and see if that helps avoid the issue.
As you say you are maybe not doing things the most efficient way. You could consider using cache tables to take a periodic copy of your remote data https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyvfed_tuning_cachetbls.html

Determine locks during process

I have a huge process(program with activerecord), which lock different tables for an amount of time.
Now I want to check all my locks during the process. So which tables are locked and for how long. I could use the activity monitor, but I need more information.
Is there a tool like the SQL Server Profiler, which list all locks during a process? Or is somewhere a logtable, which I can check?
Further Information:
There is a process in our program which use half of the tables from our database. Create new rows, update existing rows, select informations... The process runs only during the night. Now they want to run this process during the day and I have to evaluate to possibility of that request. I already checked the sourcecode, but I also want to check the database for longer locks, tablelocks and such stuff, just to be sure. The idea is, to start that process in our test environment and collect all lock informations. But I don't see all locks in the activity monitor and I can't look for an hour over the activity monitor.
There are many DMVS which will help you out to gather lock stats.Run this query based on your frequency through a SQL job and log this to table for later analysis..
--This shows all the locks involved in each session
SELECT resource_type, resource_associated_entity_id,
request_status, request_mode,request_session_id,
resource_description
FROM sys.dm_tran_locks lck
WHERE resource_database_id = db_id()
--You also can use SYS.DM_EXEC_Requests DMV to gather blockings,wait_types to understand more
select status,wait_type,last_wait_type,txt.text from sys.dm_exec_requests ec
cross apply
sys.dm_exec_sql_text(ec.sql_handle) txt

Redshift "INSERT INTO" blocked during a separate COPY

I have been playing with Redshift recently, and found an odd (or maybe not so odd) behavior. When a COPY (from S3) is in progress, if I do INSERT INTO in a completely different table in a different schema, the INSERT INTO query takes way too much time. When nothing else is running on the redshift cluster, the INSERT INTO query finishes within 3-5 minutes. But, when a COPY is in progress, the same INSERT INTO query takes 1-2 hours.
Looking at the Redshift dashboard, the odd thing is that read throughput is close to zero. Given that my INSERT INTO query contains a select, I would imagine that the read throughput would be higher. So, it feels like the COPY query is blocking all other writes. I have checked the LOCKs (STV_LOCKS) table and there is no conflict between LOCKS for COPY and INSERT INTO. Is it possible that the COPY query blocks all other writes?
Thanks in advance
You need to check parameter group configuration ( for your cluster in AWS console) -> Workload Management Configuration.
Check for concurrency .By default its 5 . you can increase the value ( max is up to 50) . This will allow concurrent connections. When you are doing copy command some of the connections are used so for insert into query , there might not connections left. So increase the concurrency and check again.
Hope this helps

How to run multiple transactions concurrently in PostgreSQL

I want to do some basic experiment on PostgreSQL, for example to generate deadlocks, to create non-repeatable reads, etc. But I could not figure out how to run multiple transactions at once to see such behavior.
Can anyone give me some Idea?
Open more than one psql session, one terminal per session.
If you're on Windows you can do that by launching psql via the Start menu multiple times. On other platforms open a couple of new terminals or terminal tabs and start psql in each.
I routinely do this when I'm examining locking and concurrency issues, used in answers like:
https://stackoverflow.com/a/12456645/398670
https://stackoverflow.com/a/12831041/398670
... probably more. A useful trick when you want to set up a race condition is to open a third psql session and BEGIN; LOCK TABLE the_table_to_race_on;. Then run statements in your other sessions; they'll block on the lock. ROLLBACK the transaction holding the table lock and the other sessions will race. It's not perfect, since it doesn't simulate offset-start-time concurrency, but it's still very helpful.
Other alternatives are outlined in this later answer on a similar topic.
pgbench is probably the best solution in yours case. It allows you to test different complex database resource contentions, deadlocks, multi-client, multi-threaded access.
To get dealocks you can simply right some script like this ('bench_script.sql):
DECLARE cnt integer DEFAULT 0;
BEGIN;
LOCK TABLE schm.tbl IN SHARE MODE;
select count(*) from schm.tabl into cnt;
insert into schm.tbl values (1 + 9999*random(), 'test descr' );
END;
and pass it to pgbench with -f parameter.
For more detailed pgbench usage I would recommend to read the official manual for postgreSQL pgBench
and get acquented with my pgbench question resolved recently here.
Craig Ringer provide a way that open mutiple transactions manualy, if you find that is not very convenient, You can use pgbench run multiple transactions at once.