Use generate_series to create a table - postgresql

In Amazon Redshift, generate_series() seems to be supported on the leader node, but not on the compute nodes. Is there a way to use generate_series to create a table on the leader node, and then push it to the compute nodes?
This query runs fine, running on the leader node:
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
However, this query fails:
create table test
diststyle all
as (
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
);
The only solution I can think of right now is to pull the query results into another program (e.g. python) and then insert the result into the database, but that seems hackish.
For those of you who've never used redshift, it's a heavily modified variant of postgresql, and has lots of it's own idiosyncrasies. The below query is completely valid an runs fine:
create table test diststyle all as (select 1 as a, 2 as b);
select * from test
yields:
a b
1 2
The problem stems from the difference between leadernode only function and compute node functions on redshift. I'm pretty sure it's not due to a bug in my query.

I have not found a way to use leader-node only functions to create tables. There is not (AFAICT) any magic syntax that you can use to make them load their output back to a table.
I ended up using number tables to achieve a similar outcome. Even a huge number table will take up very little space on your Redshift cluster with runlength compression.

Related

how to delete all rows after 1000 rows in postgresql Table

I have huge Database with 15 tables.
I need to make light version of that and leave only first 1000 rows in each table based on DESC Date. I did try to find on google how to do that but nothing really works.
It will be perfect it there will be automated way to go through each table and leave only 1000 rows.
But If I need to do that manually with each table it will be fine as well.
Thank you,
This looks positively awful, but maybe it's a starting point from which you can build.
with cte as (
select mod_date, row_number() over (order by mod_date desc) as rn
from table1
),
min_date as (
select mod_date
from cte
where rn = 1000
)
delete from table1 t1
where t1.mod_date < (select mod_date from min_date)
So solution is:
DELETE FROM "table" WHERE "date" < now() - interval '1 year';
That way it will delete all data from table where Date is older that 1 year.

How to reference a date range from another CTE in a where clause without joining to it?

I'm trying to write a query for Hive that uses the system date to determine both yesterday's date as well as the date 30 days ago. This will provide me with a rolling 30 days without the need to manually feed the date range to the query every time I run it.
I have that code working fine in a CTE. The problem I'm having is in referencing those dates in another CTE without joining the CTEs together, which I can't do since there's not a common field to join on.
I've tried various approaches but I get a "ParseException" every time.
WITH
date_range AS (
SELECT
CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT) AS start_date,
CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT) AS end_date
)
SELECT * FROM myTable
WHERE date_id BETWEEN (SELECT start_date FROM date_range) AND (SELECT end_date FROM date_range)
The intended result is the set of records from myTable that have a date_id between the start_date and end_date as found in the CTE date_range. Perhaps I'm going about this all wrong?
You can do a cross join, it does not require ON condition. Your date_range dataset is one row only, you can CROSS JOIN it with your_table if necessary and it will be transformed to a map-join (your small dataset will be broadcasted to all the mappers and loaded into each mapper memory and will work very fast), check the EXPLAIN command and make sure it is a map-join:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=250000000;
WITH
date_range AS (
SELECT
CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT) AS start_date,
CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT) AS end_date
)
SELECT t.*
FROM myTable t
CROSS JOIN date_range d
WHERE t.date_id BETWEEN d.start_date AND d.end_date
Also instead if this you can calculate dates in the where clause:
SELECT t.*
FROM myTable t
CROSS JOIN date_range d
WHERE t.date_id
BETWEEN CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT)
AND CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT)

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

How to get count of timestamps which has interval bigger than xx seconds between next row in PostgresSQL

I have table with 3 columns (postgres 9.6) : serial , timestamp , clock_name
Usually there is 1 second different between each row but sometimes the interval is bigger.
I'm trying to get the number of occasions that the timestamp interval between 2 rows was bigger than 10 seconds (lets say I limit this to 1000 rows)
I would like to do this in one query (probably select from select) but I have no idea how to write such a query , my sql knowladge is very basic.
Any help will be appreciated
You can use window functions to retrieve the next record record given the current record.
Using the ORDER BY on the function to ensure things are in time stamp order and using PARTITION to keep the clocks separate you can find for each row the row that follows it.
WITH links AS
(
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
)
SELECT * FROM links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10
You can then just compare the time stamps.
Window functions https://www.postgresql.org/docs/current/static/functions-window.html
Or if you prefer to use derived tables instead of WITH clause.
SELECT * FROM (
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
) links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10

PostgreSQL subquery not working

What's wrong with this query?
select extract(week from created_at) as week,
count(*) as received,
(select count(*) from bugs where extract(week from updated_at) = a.week) as done
from bugs as a
group by week
The error message is:
column a.week does not exist
UPDATE:
following the suggestion of the first comment, I tried this:
select a.extract(week from created_at) as week,
count(*) as received, (select count(*)
from bugs
where extract(week from updated_at) = a.week) as done from bugs as a group by week
But it doesn't seem to work:
ERROR: syntax error at or near "from"
LINE 1: select a.extract(week from a.created_at) as week, count(*) a...
As far as I can tell you don't need the sub-select at all:
select extract(week from created_at) as week,
count(*) as received,
sum( case when extract(week from updated_at) = extract(week from created_at) then 1 end) as done
from bugs
group by week
This counts all bugs per week and counts those that are updated in the same week as "done".
Note that your query will only report correct values if you never have more than one year in your table.
If you have more than one year of data in the table you need to include the year in the comparison as well:
select to_char(created_at, 'iyyy-iw') as week,
count(*) as received,
sum( case when to_char(created_at, 'iyyy-iw') = to_char(updated_at, 'iyyy-iw') then 1 end) as done
from bugs
group by week
Note that I used IYYY an IW to cater for the ISO definition of the year and the week around the year end/start.
Maybe a little explanation on why your original query did not work would be helpful:
The "outer" query uses two aliases
a table alias for bugs named a
a column alias for the expression extract(week from created_at) named week
The only place where the column alias week can be used is in the group by clause.
To the sub-select (select count(*) from bugs where extract(week from updated_at) = a.week)) the alias a is visible, but not the alias week (that's how the SQL standard is defined).
To get your subselect working (in terms of column visibility) you would need to reference the full expression of the "outer" column:
(select count(*) from bugs b where extract(week from b.updated_at) = extract(week from a.created_at))
Note that I introduced another table alias b in order to make it clear which column stems from which alias.
But even then you'd have a problem with the grouping as you can't reference an ungrouped column like that.
that could work as well
with origin as (
select extract(week from created_at) as week, count(*) as received
from bugs
group by week
)
select week, received,
(select count(*) from bugs where week = extract(week from updated_at) )
from origin
it should have a good performance