Storing time series data in redshift - amazon-redshift

I'm thinking through a redshift schema to store time series data minute by minute. When thinking about the problem in other column oriented databases (like cassandra), I was going to store my record to track as the row and each column would be a period of time. This isn't possible in redshift because of the 1,600 maximum column limitation. What would be the proper way to store this?

I would just store this as a single column:
CREATE TABLE time_series_data (
data_timestamp timestamp sortkey distkey,
data_measurement float
);

Related

Dropping tables in a particular schema after X number of days from table creation date

I have a schema specific for temporary tables in redshift. Eventually, as creation of a lot of tables takes a lot of space, I would like to know the following:
Is there a way to automate deletion of tables in that schema after X days(lets say 30 days) after the table's creation date?
Any articles on the above question I can refer to?
Thanks.
You could start on Is there any way to find table creation date in redshift?
You can first collect the output to a temporary table and then run something that DROPs tables that have age over your threshold or you can do it in one step.

Cloudant/Db2 - How to determine if a database table row was read from?

I have two databases - Cloudant and IBM Db2. I have a table in each of these databases that hold static data that is only read from and never updated. These were created a long time ago and I'm not sure if they are used today so I wish to do a clean-up.
I want to determine if these tables or rows from these tables, are still being read from.
Is there a way to record the read timestamp (or at least know if it is simply accessed like a dirty bit) on a row of the table when it is read from?
OR
Record the read timestamp of the entire table (if any record from it is accessed)?
There is SYSCAT.TABLES.LASTUSED system catalog column in Db2 for DML statements on whole table.
There is no way to track each table row read access.

How to create multiple column partitions for postgresql db, one for time range and one for specific sensor ID?

I have one application to store and query the time series data from multiple sensors. The sensor readings of multiple months needed to be stored. And we also need add more and more sensors in the future. So I need consider the scalability in two dimensions, time and sensor ID. We are using postgresql db for data storage. In addition, to simplify the data query layer design, we want to use one table name to query all the data.
In order to improve the query efficiency and scalability, I am considering to use Partitions for this use case. And I want to create the partitions based on two columns. RANGE for the event time for the readings. And VALUE for the sensor ID. So under the paritioned table, I want to get some sub table as sensor_readings_1week_Oct_2020_ID1, sensor_readings_2week_Oct_2020_ID1, sensor_readings_1week_Oct_2020_ID2. I knew PostgreSQL supports multiple columns Partition, but from most examples I can only see RANGE are used for all the columns. One example is as below. How can I create the multiple column partitions, one is for the time RANGE, another is based on the specific sensor ID? Thanks!
CREATE TABLE p1 PARTITION OF tbl_range
FOR VALUES FROM (1, 110, 50) TO (20, 200, 200);
Or are there some better solutions besides Paritions for this use case?
The two level partitions is a good solution for my use case. It improves the efficiency a lot.
CREATE TABLE sensor_readings (
id bigserial NOT NULL,
create_date_time timestamp NULL DEFAULT now(),
value int8 NULL
) PARTITION BY LIST (id);
CREATE TABLE sensor_readings_id1
PARTITION OF sensor_readings
FOR VALUES IN (111) PARTITION BY RANGE(create_date_time);
CREATE TABLE sensor_readings_id1_wk1
PARTITION OF sensor_readings_id1
FOR VALUES FROM (''2020-10-01 00:00:00'') TO (''2020-10-20 00:00:00'');

Redshift time-series table loading questions

Redshift documentation identifies time-series tables as a best practice:
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
how many tables within a union-all view is reasonable - hundreds? (unanswered)
any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)
My situation:
100 million rows added daily, which will grow to 500 million in 3 years
12 month retention desired
Estimated 99% of all queries will hit the most recent 1-7 days
Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.
My proposed solution:
Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
Have firehose load into staging table
Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
Once a month drop old tables, create next month of tables, and rebuild view.
This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
Once a night run the vacuum analyzer.
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.
Your proposed process sounds quite good! While I can't answer all your questions, here is some information:
Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?
Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).
Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?
Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.
If a query includes a WHERE clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY column, since similar ranges of data are grouped together.
Given that you are using a date as the SORTKEY, Redshift will read very few disk blocks if the query includes a WHERE clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.