How do I find load history of a Redshift table? - amazon-redshift

I loaded data that isn't passing QA, and need to go back to a previous day where the data matched QA results. Is there a system table that I can query to see dates where the table was previously loaded or copied from S3?
I'm mostly interested in finding the date information of previous loads.

SELECT Substring(querytxt, 5, Regexp_instr(querytxt, ' FROM') - 5),
querytxt
FROM stl_query s1,
stl_load_commits s2
WHERE s1.query = s2.query
AND Upper(s1.querytxt) LIKE '%COPY%'
AND Lower(s1.querytxt) LIKE '%s3://%'

You can use these tables all load ( copy) STL_LOAD_COMMITS
This table for all load error STL_LOAD_ERRORS
To do rollback, there is no direct process. What we need to do is create copy of data into some _QA_passed table before we do fresh load into table and rename tables if we have some issues.
Other way is if you have date , using which you are loading some data, then you can use delete query to remove , fresh data which is not good and run vacuum commands to free up space.

Related

Talend - Insert/Update versus table-file-table

I have been using insert/update to update or insert a table in mysql from sql server. The job is set up as a cronjob. The job runs every 8 hours. The number of records in the source table is around 400000. Every 8 hours around 100 records might get updated or inserted.
I run the job in such a away that at the source level, I only take the modified runs between the last run and the current run.
I have observed that just to update / insert 100 rows the time taken is 30 minutes.
However, another way was to dump all of the 400000 in a file and then truncate the destination table and insert all of those records all over again. This process is done at every job run
So, now may I know why does insert/update take so much time?
Thanks
Rathi
As you said you run the job in such a away that at the source level, I only take the modified runs between the last run and the current run.
So just insert all these modified rows in a temp table
Take the min date modified date from temp table or use the same criteria which you use to extract only modified rows from source and delete all the rows from the destination table.
Then you can insert all the rows from temp to end table.
Let me know if you have any question.
Without knowing how your database is configured, it's hard to tell the exact reason, but I'd say the updates are slow because you don't have an index on your target table.
Try adding an index on your insert/update key column, it will speed things up.
Also, are you doing a commit after each insert ? If so, disable autocommit, and only commit on success like this : tMysqlOutput -- OnComponentOk -- tMysqlCommit.

IBM DB2 9.7 archiving specific tables and columns

I would like to periodically, e.g. once a year, archive a set of table rows from our DB2 9.7 database based on some criteria. So e.g. once a year, archive all EMPLOYEE rows that have a creation date older than 1 year ago?
By archive I mean that the data is moved out of the DB schema and stored in some other location, in a retrievable format. Is this possible to do?
System doesn't need to access archived data
If you don't need to access archived data by your program then I would suggest this:
create export (reference here) script, e.g.:
echo '===================== export started ';
values current time;
-- maybe ixf format would be better?
export to tablename.del of del
select * from tablename
where creation_date < (current date - 1 year)
;
echo '===================== export finished ';
create delete db2 script, e.g.:
echo '===================== delete started ';
values current time;
delete from tablename.del of del
where creation_date < (current date - 1 year)
;
commit;
echo '===================== delete finished ';
write batch script which calls everything copies new exported file to the safe location. When calling the script we want to ensure that delete is not done until data is placed on safe:
db2 connect to db user xx using xxx
db2 -s -vtf export.sql
7z a safe-location-<date-time>.7z tablename.del
if no errors till now:
db2 -s -vtf delete.sql
register batch script as a cron job to do this automatically
Again, since deleting is very sensitive operation, I would suggest to have more than one backup mechanisms to ensure that no data will be lost (e.g. delete to have some different timeframe - e.g. delete older than 1.5 year).
System should access archived data
If you need your system to access archived data, then I would suggest one of the following methods:
export / import to other db or table / delete
stored procedure which does select + insert to other db or table / delete - for example you can adapt sp in answer nr. 3 in this question
do table partitioning - reference here
Sure, why not? One fairly straight forward way is to write a stored procedure thab basically would:
extract all records of the a given table you wish to archive into a temp table,
insert those temp records into the archive table,
delete from the given table where the primary key is IN the temp table
If you wanted only a subset of columns to go into your archive, you could extract from a view containing just those columns, as long as you still capture the primary key in your temp file.

Postgresql, query results to new table

Windows/NET/ODBC
I would like to get query results to new table on some handy way which I can see through data adapter but I can't find a way to do it.
There is no much examples around to satisfy beginner's level on this.
Don't know temporary or not but after seeing results that table is no more needed so I can delete it 'by hand' or it can be deleted automatically.
This is what I try:
mCmd = New OdbcCommand("CREATE TEMP TABLE temp1 ON COMMIT DROP AS " & _
"SELECT dtbl_id, name, mystr, myint, myouble FROM " & myTable & " " & _
"WHERE myFlag='1' ORDER BY dtbl_id", mCon)
n = mCmd.ExecuteNonQuery
This run's without error and in 'n' I get correct number of matched rows!!
But with pgAdmin I don't see those table no where?? No matter if I look under opened transaction or after transaction is closed.
Second, should I define columns for temp1 table first or they can be made automatically based on query results (that would be nice!).
Please minimal example to illustrate me what to do based on upper code to get new table filled with query results.
A shorter way to do the same thing your current code does is with CREATE TEMPORARY TABLE AS SELECT ... . See the entry for CREATE TABLE AS in the manual.
Temporary tables are not visible outside the session ("connection") that created them, they're intended as a temporary location for data that the session will use in later queries. If you want a created table to be accessible from other sessions, don't use a TEMPORARY table.
Maybe you want UNLOGGED (9.2 or newer) for data that's generated and doesn't need to be durable, but must be visible to other sessions?
See related: Is there a way to access temporary tables of other sessions in PostgreSQL?

syntax for COPY in postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.
As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.
You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.

Faster way to transfer table data from linked server

After much fiddling, I've managed to install the right ODBC driver and have successfully created a linked server on SQL Server 2008, by which I can access my PostgreSQL db from SQL server.
I'm copying all of the data from some of the tables in the PgSQL DB into SQL Server using merge statements that take the following form:
with mbRemote as
(
select
*
from
openquery(someLinkedDb,'select * from someTable')
)
merge into someTable mbLocal
using mbRemote on mbLocal.id=mbRemote.id
when matched
/*edit*/
/*clause below really speeds things up when many rows are unchanged*/
/*can you think of anything else?*/
and not (mbLocal.field1=mbRemote.field1
and mbLocal.field2=mbRemote.field2
and mbLocal.field3=mbRemote.field3
and mbLocal.field4=mbRemote.field4)
/*end edit*/
then
update
set
mbLocal.field1=mbRemote.field1,
mbLocal.field2=mbRemote.field2,
mbLocal.field3=mbRemote.field3,
mbLocal.field4=mbRemote.field4
when not matched then
insert
(
id,
field1,
field2,
field3,
field4
)
values
(
mbRemote.id,
mbRemote.field1,
mbRemote.field2,
mbRemote.field3,
mbRemote.field4
)
WHEN NOT MATCHED BY SOURCE then delete;
After this statement completes, the local (SQL Server) copy is fully in sync with the remote (PgSQL server).
A few questions about this approach:
is it sane?
it strikes me that an update will be run over all fields in local rows that haven't necessarily changed. The only prerequisite is that the local and remote id field match. Is there a more fine grained approach/a way of constraining the merge statment to only update rows that have actually changed?
That looks like a reasonable method if you're not able or wanting to use a tool like SSIS.
You could add in a check on the when matched line to check if changes have occurred, something like:
when matched and mbLocal.field1 <> mbRemote.field1 then
This many be unwieldy if you have more than a couple of columns to check, so you could add a check column in (like LastUpdatedDate for example) to make this easier.