Improve performance of deletes on a table variable - tsql

I have seen performance tweaks for delete on normal tables in t-sql.
But are there performance tweaks on deletes on table variables to be done?
EDIT
Here's an example:
The plot gets thicker, as UserExclusionsEvaluate is actually a CTE, but I'm going to try and optimise it around the table variable first (if possible). The CTE itself runs very quickly. Just the delete that is slow.
DELETE FROM #UsersCriteria
FROM #UsersCriteria UsersCriteria
WHERE UserId IN (SELECT UserID FROM UserExclusionsEvaluate WHERE PushRuleExclusionMet = 1)
In it's current incarnation, #UsersCriteria is:
DECLARE #UsersCriteria TABLE
(
UserId int primary key,
PushRuleUserCriteriaType int
)
I've tried #UsersCriteria as non primary and experimented with clustered non-clustered.
It's probable also the problem is with the IN. I've also tried a JOIN on a subquery.
EDIT:
GOOD NEWS!
After lots of playing with the SQL, including moving the suquery into a chained CTE, attempting table hints etc etc etc.
A simple change from a table variable to a temp table dramatically improved the performance.
Which is really interesting, as deletes ran fine byself, the subquery (on the CTE) ran fine byitself. But mixing the two ran mega slow.
I'm guessing that the optimiser can't kick in when using a CTE in a subquery? Maybe when mixed with the delete.

Not really.
Unless you defined a PK in the DECLARE which may work: there are no statistics for table variables and the table is assumed to have 1 row only

Well there is a limited amount you can do. However, if you have a large data set in the table variable, you should be using a temp table instead if you need better performance.
You could also do the deletes in batches (say 1000 at a time).
Otherwise, show us your delete statment and we'll see if we see anything that can be imporved.

NO.
Table variables are unindexable and transient. They have no statistics.
They are not intended to store very large amounts of data.
If you have a table variable that is big enough to give you performance problems when you delete from it, you're using them in an unintended way. Put that data into a #Temp table or a real table so you have more control.

Related

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

How to solve log slowness with or without NoSql

I am having a problem regarding a Log Searching Speed and Disk Size.
It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.
How does it work?
The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).
The log consists with two tables (name changed to english so the people here are able to understand):
LogTable
Id, DateTime, User, Url, Action and TableName.
Action is what the used did: insert/delete/update
TableName is which table in the database was affected.
LogTableFields
Id, LogTable_Id, FieldName, NewValue, OldValue.
LogTable_Id is foreign key from LogTable.
FieldName is the field of the table from DB.
Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).
Why it was made this way?
Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.
Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.
But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.
Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)
Some ideas suggested until now:
Idea 1:
I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.
Idea 2:
Separate the Log Tables physically from the Database or from the machine.
Idea 3: Create a mirror from every table with a version field to keep history.
I'd like a macro optimization or architecture change if needed.
This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.
It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.
If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.
Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:
create table LogTable (
LogTableID int identity(1,1),
TableName varchar(100),
Url varchar(200)
)
create table LogUser (
LogUserID int indentity(1,1),
UserName varchar(100)
)
create table LogField (
LogFieldID int identity(1,1),
FieldName varchar(100),
)
create table LogData (
LogDataID bigint identity(1,1),
LogDate datetime,
LogTableID int references LogTable(LogTableID),
LogFieldID int references LogField(LogFieldID),
LogUserID int references LogUserID(LogUserID),
Action char(1), -- U = update, I = insert, D = delete
OldValue varchar(100),
NewValue varchar(100)
)
This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.
Adding proper indices is going to be the biggest improvement you can make. You don't mention having any indices, so I assume you don't have any. That would make it very slow.
For example, limiting your query to a particular range of DateTime doesn't help at all unless you have an index on DateTime. Without an index, the database still needs to touch nearly all 25GB of data to find the few rows that are in the right time range. But with an index, it could quickly identify the few rows that are in the time range you care about.
In general, you should always ask your database what plan it is using to execute a query that is taking too long. I'm not particularly familiar with Sql Anywhere, but I know it has a Plan Viewer that can do this. You want to identify big sequential scans and put indices on those fields instead.
I doubt you would see a measurable improvement from breaking up the table and using integer foreign keys. To the extent that your queries touch many columns, you'll just end up joining all those tables back together anyway.

PostgreSQL: Loading data into Star Schema efficiently

Imagine a table with the following structure on PostgreSQL 9.0:
create table raw_fact_table (text varchar(1000));
For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.
To increase performance I want to convert to the following star schema structure:
create table dimension_table (id int, text varchar(1000));
The fact table would then be replaced with a fact table like the following:
create table fact_table (dimension_table_id int);
My current method is to essentially run the following query to create the dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
then to create fill the dimension table I use:
insert into dimension_table (select null, text from raw_fact_table group by text);
Afterwards I need to run the following query:
select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);
Just imagine the horrible performance I get by comparing all strings to all other strings several times.
On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?
Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):
"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
Just imagine the horrible performance
I get by comparing all strings to all
other strings several times.
When you've been doing this a while, you stop imagining performance, and you start measuring it. "Premature optimization is the root of all evil."
What does "billion" mean to you? To me, in the USA, it means 1,000,000,000 (or 1e9). If that's also true for you, you're probably looking at between 1 and 7 terabytes of data.
My current method is to essentially
run the following query to create the
dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
How are you gonna fit 10 billion rows into a table that uses an integer for a primary key? Let's even say that half the rows are duplicates. How does that arithmetic work when you do it?
Don't imagine. Read first. Then test.
Read Data Warehousing with PostgreSQL. I suspect these presentation slides will give you some ideas.
Also read Populating a Database, and consider which suggestions to implement.
Test with a million (1e6) rows, following a "divide and conquer" process. That is, don't try to load a million at a time; write a procedure that breaks it up into smaller chunks. Run
EXPLAIN <sql statement>
You've said you estimate at least 99% duplicate rows. Broadly speaking, there are two ways to get rid of the dupes
Inside a database, not necessarily the same platform you use for production.
Outside a database, in the filesystem, not necessarily the same filesystem you use for production.
If you still have the text files that you loaded, I'd consider first trying outside the database. This awk one-liner will output unique lines from each file. It's relatively economical, in that it makes only one pass over the data.
awk '!arr[$0]++' file_with_dupes > file_without_dupes
If you really have 99% dupes, by the end of this process you should have reduced your 1 to 7 terabytes down to about 50 gigs. And, having done that, you can also number each unique line and create a tab-delimited file before copying it into the data warehouse. That's another one-liner:
awk '{printf("%d\t%s\n", NR, $0);}' file_without_dupes > tab_delimited_file
If you have to do this under Windows, I'd use Cygwin.
If you have to do this in a database, I'd try to avoid using your production database or your production server. But maybe I'm being too cautious. Moving several terabytes around is an expensive thing to do.
But I'd test
SELECT DISTINCT ...
before using GROUP BY. I might be able to do some tests on a large data set for you, but probably not this week. (I don't usually work with terabyte-sized files. It's kind of interesting. If you can wait.)
Just to questions:
- it neccessary to convert your data in 1 or 2 steps?
- May we modify the table while converting?
Running more simplier queries may improve your performance (and the server load while doing it)
One approach would be:
generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...
Or another (probably better):
create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
create dimension_table
create an after insert trigger for fact_table which:
searches for dimension_text in fact_table
if not found, creates a new record in dimension_table
updates dimension_id to this id
in a simle loop, insert every record from raw_fact_table to fact_table
You are omitting some details there at the end, but I don't see that there necessarily is a problem. It is not in evidence that all strings are actually compared to all other strings. If you do a join, PostgreSQL could very well pick a smarter join algorithm, such as a hash join, which might give you the same hashing that you are implementing yourself in your MySQL solution. (Again, your details are hazy on that.)
-- add unique index
CREATE UNIQUE INDEX uidx ON dimension_table USING hash(text);
-- for non case-sensitive hash(upper(text))
try hash(text); and btree(text) to see which one is faster
I an see several ways of solving your problem
There is md5 function in PostgreSql
md5(string) Calculates the MD5 hash of string, returning the result in hexadecimal
insert into dimension_table (select null, md5(text), text from raw_fact_table group by text)
add md5 field into raw_fact_table as well
select id into fact_table from dimension inner join raw_fact_table on (dimension.md5 = raw_fact_table.md5);
Indexes on MD5 filed might help as well
Or you can calculate MD5 on the fly while loading the data.
For example our ETL tool Advanced ETL processor can do it for you.
Plus it can load data into multiple tables same time.
There is a number of on-line tutorials available on our web site
For example this one demonstrates loading slow changing dimension
http://www.dbsoftlab.com/online-tutorials/advanced-etl-processor/advanced-etl-processor-working-with-slow-changing-dimension-part-2.html

Best use of indices on temporary tables in T-SQL

If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?
A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?
What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.
In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.