Union with Redshift native tables and external tables (Spectrum) - amazon-redshift

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!

How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

Related

PostgreSQL Table size and partition consideration

I am working on a use case where initial data load for name table in PostgreSQL DB will be around 650 million rows with average row size of 0.6 KB bringing table size up to 400 GB. After that there could be up to 20,000 inserts or updates on daily basis.
I am new to PostgreSQL, want to check if I should consider partitioning looking at the table size.
Updating some information from Comments section:
It is an OLTP application for Identity resolution for business names, this is one specific table where all the business names are stored along with the metadata such as Start Date, End Date and any incoming name is matched with existing names to identify if it is related to another business. This table is updated throughout the day using batch files from different data sources.
Also, we are not planning to expire or remove any data from this table.

(Databricks) last_value function usage on huge dataset

For table which contains bilions of rows (10 mln each day) we trying to "fix/clean" source data. For missing rows with measures x,y,z we copy them from the past. To do it right we are using last_value function on specified aggregation level combined by CROSS JOIN with simple table which contains missing days.
It worked in simple POC where we have less than 1mln rows. In production dataset described above time and performance are pretty bad. We cannot just simply extend cluster, we need to try different approach.
Any thoughts what we can use insted of sql (simple last_value)?
All is caluclated in Azure Databricks environment.

Is there a technique with timescaledb to delete rows to reduce the frequency of older timescale data?

I'm storing a number of rows in a hypertable. The table size is growing quite large now even in its current test configuration.
I'd like to reduce the frequency of data from say once every 5 seconds to say once every 60 seconds for data older than a week by deleting a number of these older records.
Can anyone recommend an approach for doing so, or perhaps a better approach that better fits with timescaledb design?
So one of the next releases will have a bit in feature around data retention policies around continuous aggregations, so that you can define a continuous aggregation policy that rolls up secondly data into minutely data, then drop the secondly data that's older than some time period.
(That capability doesn't exist today with continuous aggs, but will very shortly. Right now the best approach is either to have some cron job that deletes old data, or one that copies from one table to a second while aggregating, then calling drop_chunks on the first table.)
Ok, I've read 2 minutes of timescaledb documentation, so I'm an expert, right. Here's what I propose:
You already have a table (I'll call it the business table) and a hypertable with raw 5-second data in it
Create a second hypertable with the same columns as the first hypertable
Insert into the 2nd hypertable using a 60-second windowing function and average, minimum, or maximum values for your readings data (you have to decide on which aggregation function is meaningful for your case.) This insert SQL looks something like:
INSERT into minute_table (timestamp, my_reading)
(SELECT time_bucket('60 seconds', time) as the_minute, avg(my_raw_reading)
FROM five_second_table
WHERE time < (now() - interval '1week')
GROUP BY the_minute
);
Next, delete from the 5-second hypertable where the timestamp in there is within any range of times in the 60-second hypertable.
Finally, schedule something like this to run every week.
Sorry I'm not fluent in all the timescaledb functions but this should get you started on the 'heavy lift' of manually aggregating up from 5-second to 60-second samples.
Take a look on Data Retention
For example:
SELECT drop_chunks(interval '24 hours', 'conditions');
This will drop all chunks from the hypertable 'conditions' that only include data older than this duration, and will not delete any individual rows of data in chunks.

Oracle Gather Statistics after Partition Exchange

Database: Oracle 12c
I have a process that selects data from Fact Table, summarizes it and push it to Summary Table.
Summary table is Range Partitioned (Trade Date) and List Partitioned (File Id).
the process picks up data from Fact table (where file_id=<> for all Trade Dates), summarizes it in a temp table and use Partition Exchange to move data from Temp table to one of the SubPartitions in Summary Table (as the process works on a File Id level).
Summary table is completely refreshed everyday (100% data will be exchanged).
Before the data is exchanged at the subpartition level, statistics are gathered and exchanged along with the data.
After the process is completed, we run dbms_gather_table_stats at partition level (in a for loop - for each partition) with granularity set as "approx_global and partition".
Even though we collect stats at the global level, user_tab_statistics for the summary table has "STALE_STATS" = YES for this table, however, partition & Subpartition stats are available.
when we run a query against the summary table (for a date range of 3 years), the query spins for a long time - spiking the CPU to 90%, but never returns any data.
I checked the explain plan on the query, Cardinality is showing as 1.
I read about incremental stats, but it seems increment will work if a few partitions change - it may not be the best option in my case, where data across all the partitions change completely.
I m looking for a strategy to gather statistics on the summary table - don't want to run a full gather stats.
Thanks.

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.