Cloudant/Db2 - How to determine if a database table row was read from? - db2

I have two databases - Cloudant and IBM Db2. I have a table in each of these databases that hold static data that is only read from and never updated. These were created a long time ago and I'm not sure if they are used today so I wish to do a clean-up.
I want to determine if these tables or rows from these tables, are still being read from.
Is there a way to record the read timestamp (or at least know if it is simply accessed like a dirty bit) on a row of the table when it is read from?
OR
Record the read timestamp of the entire table (if any record from it is accessed)?

There is SYSCAT.TABLES.LASTUSED system catalog column in Db2 for DML statements on whole table.
There is no way to track each table row read access.

Related

cannot truncate the tables in landing area after transfering data

I have 2 schemas with exactly the same 15 tables in Postgres. Data will be inserted in to first schema every 3 hours. then data needs to be transfer into second schema.
afterthat tables of the first schema needs to be truncated. (landing area needs to be empty).I wrote a trigger to transfer data after inserting to first schema into second schema.
but how tables of the first schema should be truncated?
I searched already and I tried two ways but non of them works.
1.I put the truncate command after all (insert into... conflict on... )in the same trigger's function that transfer data from first schema into second schema. which doesn't work.
2.I made another trigger which will implemented after insert into (or update) tables of second schema. which also doesn't work. why this one doesn't work? if data already inserted into second_schema. it should be possible to trucate first_schema. It isnot in use by active query.
Error->cannot TRUNCATE "table1" because it is being used by active queries in this session
what should I do? I need to tranfer data after every insert into first schema, to second schema. and then truncate all 15 tables of the first schema.
Tables of first schema should be empty for new insert.

Dropping tables in a particular schema after X number of days from table creation date

I have a schema specific for temporary tables in redshift. Eventually, as creation of a lot of tables takes a lot of space, I would like to know the following:
Is there a way to automate deletion of tables in that schema after X days(lets say 30 days) after the table's creation date?
Any articles on the above question I can refer to?
Thanks.
You could start on Is there any way to find table creation date in redshift?
You can first collect the output to a temporary table and then run something that DROPs tables that have age over your threshold or you can do it in one step.

Redshift time-series table loading questions

Redshift documentation identifies time-series tables as a best practice:
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
how many tables within a union-all view is reasonable - hundreds? (unanswered)
any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)
My situation:
100 million rows added daily, which will grow to 500 million in 3 years
12 month retention desired
Estimated 99% of all queries will hit the most recent 1-7 days
Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.
My proposed solution:
Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
Have firehose load into staging table
Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
Once a month drop old tables, create next month of tables, and rebuild view.
This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
Once a night run the vacuum analyzer.
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.
Your proposed process sounds quite good! While I can't answer all your questions, here is some information:
Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?
Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).
Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?
Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.
If a query includes a WHERE clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY column, since similar ranges of data are grouped together.
Given that you are using a date as the SORTKEY, Redshift will read very few disk blocks if the query includes a WHERE clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

Insert data from staging table into multiple, related tables?

I'm working on an application that imports data from Access to SQL Server 2008. Currently, I'm using a stored procedure to import the data individually by record. I can't go with a bulk insert or anything like that because the data is inserted into two related tables...I have a bunch of fields that go into the Account table (first name, last name, etc.) and three fields that will each have a record in an Insurance table, linked back to the Account table by the auto-incrementing AccountID that's selected with SCOPE_IDENTITY in the stored procedure.
Performance isn't very good due to the number of round trips to the database from the application. For this and some other reasons I'm planning to instead use a staging table and import the data from there. Reading up on my options for approaching this, a cursor that executes the same insert stored procedure on the data in the staging table would make sense. However it appears that cursors are evil incarnate and should be avoided.
Is there any way to insert data into one table, retrieve the auto-generated IDs, then insert data for the same records into another table using the corresponding ID, in a set-based operation? Or is a cursor my only option here?
Look at the OUTPUT clause. You should be able to add it to your INSERT statement to do what you want.
BTW, if you need to output columns into the second table that weren't inserted into the first one, then use MERGE instead of INSERT (as suggested in the comment to the original question) as its OUTPUT clause supports referencing other columns from the source table(s). Otherwise, keeping it with an INSERT is more straightforward, and it does give you access to the inserted identity column.
I'm having experiment to worked out in inserting multiple record into related table using databinding. So, try this!
Hopefully this is very helpful. Follow this link How to insert record into related tables. for more information.