How to updated numRows on Amazon Spectrum based on a count from a query - amazon-redshift

Hi I'm trying to automate a workflow in Airflow where I am going to be appending rows to an external Spectrum table daily and I need to alter the numRows on the spectrum table by extracting the count of the existing table + the new count of rows I am appending.
CREATE EXTERNAL TABLE spectrum.my_external_table
(
id INTEGER,
barkdata_timestamp timestamp,
created_at timestamp,
updated_at timestamp
)
PARTITIONED BY (asofdate timestamp)
STORED AS PARQUET
LOCATION 's3://<SOME BUCKET>/manifest'
table properties ('numRows'= '<some number>';
ALTER TABLE spectrum.my_external_table
ADD PARTITION (asofdate='2021-03-03 00:00:00') LOCATION 's3://<SOME BUCKET>/asofdate=2021-03-03 00:00:00/';
ALTER TABLE spectrum.couponable_coupon
SET TABLE PROPERTIES ('numRows'='<HELP HERE should be count(*) from my_external_table + count(*) from table_I_unloaded_to_s3 where asofdate='2021-03-03 00:00:00'>');

Related

Postgres partitioned table query scans all partitions instead of one

I have a table
CREATE TABLE IF NOT EXISTS prices
(shop_id integer not null,
good_id varchar(24) not null,
eff_date timestamp with time zone not null,
price_wholesale numeric(20,2) not null default 0 constraint chk_price_ws check (price_wholesale >= 0),
price_retail numeric(20,2) not null default 0 constraint chk_price_rtl check (price_retail >= 0),
constraint pk_prices primary key (shop_id, good_id, eff_date)
)partition by list (shop_id);
CREATE TABLE IF NOT EXISTS prices_1 partition of prices for values in (1);
CREATE TABLE IF NOT EXISTS prices_3 partition of prices for values in (2);
CREATE TABLE IF NOT EXISTS prices_4 partition of prices for values in (3);
CREATE TABLE IF NOT EXISTS prices_4 partition of prices for values in (4);
...
CREATE TABLE IF NOT EXISTS prices_6 partition of prices for values in (100);
I'd like to delete outdated prices. The table is huge , so I try to delete small portions of records.
If I use loop and the variable v_shop_id then after 6 times Postgres starts scanning all partitions. I simplified the code, the real code has inner loop by shop_id.
If I use loop without the variable (I explicitly specify the value) Postgres doesn't scan all partitions
here code with the variable
do $$
declare
v_shop_id integer;
v_date_time timestamp with time zone := now();
begin
v_shop_id := 8;
for step in 1..10 loop
delete from prices p
using (select pd.good_id, max(pd.eff_date) as mxef_dt
from prices pd
where pd.eff_date < v_date_time - interval '30 days'
and pd.shop_id = v_shop_id
group by ppd.good_id
having count(1)>1
limit 40000) pfd
where p.eff_date <= pfd.mxef_dt
and p.shop_id = v_shop_id
and p.good_id = pfd.good_id;
end loop;
end;$$LANGUAGE plpgsql
How can I force Postrges to scan one desired partition only?

Create Partitioned Table from Partitioned Table - Postgresql

Let's say I have a partitioned table A.
create table A (
col1 timestamp,
col2 int
)
partition by col2;
create table partition1 partition of A from values (minvalue) to (y);
create table partition1 partition of A from values (y) to (maxvalue);
copy A from '/some/csv/file'
The above code gives me a paritioned table A with the data populated. I want to create another table using -
create table B as (
select *,
col2 * 3 as col3 -- Add a new column
from A
);
Can I save A as a partitioned CSV/'insert_format' file?
Is it possible that B is also paritioned the same way A is?

Sum and average total columns in PostgreSQL

I'm using this query to find duplicate dates but not sure how to sum each duplicate dates, average it and remove duplicate dates.
DB Schema
date_time
datapoint_1
datapoint_2
SQL Query
SELECT date_time, COUNT(date_time)
FROM MYTABLE
GROUP BY date_time
HAVING COUNT(date_time) > 1
ORDER BY COUNT(date_time)
I would create a new table to replace the old one. That is easier and might even perform better:
CREATE TABLE mytable2 (LIKE mytable);
INSERT INTO mytable2 (date_time, datapoint_1, datapoint_2)
SELECT m.date_time, avg(m.datapoint_1), avg(m.datapoint_2)
FROM mytable AS m
GROUP BY m.date_time;
Then you can drop mytable and rename mytable2 to replace it.
To prevent new rows from creating duplicates, you could change the way you insert data:
-- to keep track of counts
ALTER TABLE mytable ADD numval integer DEFAULT 1;
-- to prevent duplicates
ALTER TABLE mytable ADD UNIQUE (date_time);
-- to insert new rows
INSERT INTO mytable (date_time, datapoint_1, datapoint_2)
VALUES ('2021-06-30', 42.0, -34.9)
ON CONFLICT (date_time)
DO UPDATE SET numval = mytable.numval + 1,
datapoint_1 = mytable.datapoint_1 + excluded.datapoint_1,
datapoint_2 = mytable.datapoint_2 + excluded.datapoint_2;
-- to select the averages
SELECT date_time,
datapoint_1 / numval AS datapoint_1,
datapoint_2 / numval AS datapoint_2
FROM mytable;
When you use GROUP BY you can also use aggregate functions to reduce multiple lines to a single one (COUNT, that you used is one of such functions). In your case the query would be:
SELECT date_time, avg(datapoint_1), avg(datapoint_2)
FROM MYTABLE
GROUP BY date_time
For every distinct date_time you will get a single row with the average of datapoint_1 and datapoint_2.

postgreql insert when no row exists

I have a table in postgresql named 'views', containing information about users viewing a classified ad.
CREATE TABLE views (
view_id uuid DEFAULT random_gen_uuid() NOT NULL,
user_id uuid NOT NULL,
ad_id uuid NOT NULL,
timestamp timestamp with time zone DEFAULT 'NOW()' NOT NULL
);
I want to be able to insert a row for a specific user/ad ONLY when there is no other row 'younger' than 5 minutes. So I want to check if there already is a row with the user ID and the ad ID and where the timestamp is less than 5 minutes old. If so, I want to do something like INSERT... ON CONFLICT DO NOTHING.
Is this possible to do with a UNIQUE constraint? Or do I need a CHECK constraint, or do I have to do a separate query first every time I insert this?
You have to do a lookup first, but you can do the lookup and the insert in one statement using something like this:
with invars (user_id, ad_id) as (
values (?, ?) -- Pass your two ids in
)
insert into views (user_id, ad_id)
select user_id, ad_id
from invars i
where not exists (select 1
from views
where (user_id, ad_id) = (i.user_id, i.ad_id)
and "timestamp" >= now() - interval '5 minutes');

Postgres add serial column in create table as

I need to add a serial field or increasing the id field for each row in a query
The following code is an attempt to do what I want.
create temp table tt_Final as
SELECT
'Transaccion' = Table1.Nombrem,
Table1.number as "number",
"Id"= Serial
from Table1