We use Amazon Redshift in our project.
In our project, we have assigned different schemas to different teams. For example, marketing gets a separate schema to store their tables for analysis, while the sales team gets a separate schema.
What is happening is that the analysts from one group use up majority of the database's space with tables which are more temporary in nature and don't care to drop it/purge it. So, the discipline to maintain their own schemas is left with the individual schema owners. Every now and then, we end up doing a housekeeping exercise.
I wanted to know if we can configure the size per schema/database. Let's say, we allot 100 GB to the sales schema, 50 GB to marketing and so on...
According to Redshift docs, Redshift seems not to provide a feature to restrict the size per schema/database, but there is a workaround.
Since you can get the data size per table with the following query, you can write a script which monitors their usage and send an alert if exceeded. Then, simply run the script periodically through cron.
Query to get data size and number of rows per table
select
trim(pgdb.datname) as database, trim(pgn.nspname) as schema,
trim(a.name) as Table, b.mbytes, a.rows
from
(select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
order by 1, 2, 3;
ex) Result
database | schema | table | mbytes | rows
----------+---------------+-------------+--------+----------+
test_db | dev_schmea_1 | click_log | 23 | 4653
prod_db | prod_schema_1 | click_log | 16217 | 2112354
prod_db | prod_schema_1 | install_log | 5544 | 433538
Query to get data size and number of rows per schema
select
trim(pgdb.datname) as database, trim(pgn.nspname) as schema,
sum(b.mbytes) as mbytes, sum(a.rows) as rows
from
(select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
group by pgdb.datname, pgn.nspname
order by 1, 2;
Query to get data size and number of rows per database
select
trim(pgdb.datname) as database, sum(b.mbytes) as mbytes, sum(a.rows) as rows
from
(select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
group by pgdb.datname
order by 1;
This feature now exists in Redshift:
Redshift Create Schema Docs
Relevant example from the docs:
create schema us_sales authorization dwuser QUOTA 50 GB;
Related
Let's say I want the ids that are counted as a result of my query, and checked if the same ids appear in a different month.
Here I join 2 tables via distinct id's and count the returning rows to know how many of the matching id's I have. Here that is for the month June.
I'd like:
eg. in June 100 distinct ids
eg. in July 90 of the same ids left
Please help!
I am stuck as my Sql is not very advanced,...
with total as (
select distinct(transactions.u_id), count(*)
from transactions
join contacts using (u_id)
join table using (contact_id)
where transactions.when_created between '2020-06-01' AND '2020-06-30'
group by transactions.u_id
HAVING COUNT(*) > 1
)
SELECT
COUNT(*)
FROM
total
Let's say that you are interested about the query of the like of
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id;
You are also interested in
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-08-01' and '2020-08-30'
group by transactions.u_id;
And you want to get:
the ids which can be found by both
the minimum total
Then, you can do something of the like of
select t1.u_id,
case
when t1.total > t2.total then t2.total
else t1.total
end as total
from (
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id) t1
join (
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id) t2
on t1.u_id = t2.u_id
I have a many to many relation with three columns, (owner_id,property_id,ownership_perc) and for this table applies (many owners have many properties).
So I would like to find all the owner_id who has many properties (property_id) and connect them with other three tables (Table 1,3,4) in order to get further information for the requested result.
All the tables that I'm using are
Table 1: owner (id_owner,name)
Table 2: owner_property (owner_id,property_id,ownership_perc)
Table 3: property(id_property,building_id)
Table 4: building(id_building,address,region)
So, when I'm trying it like this, the query runs but it returns empty.
SELECT address,region,name
FROM owner_property
JOIN property ON owner_property.property_id = property.id_property
JOIN owner ON owner.id_owner = owner_property.owner_id
JOIN building ON property.building_id=building.id_building
GROUP BY owner_id,address,region,name
HAVING count(owner_id) > 1
ORDER BY owner_id;
Only when I'm trying the code below, it returns the owner_id who has many properties (see image below) but without joining it with the other three tables:
SELECT a.*
FROM owner_property a
JOIN (SELECT owner_id, COUNT(owner_id)
FROM owner_property
GROUP BY owner_id
HAVING COUNT(owner_id)>1) b
ON a.owner_id = b.owner_id
ORDER BY a.owner_id,property_id ASC;
So, is there any suggestion on what I'm doing wrong when I'm joining the tables? Thank you!
This query:
SELECT owner_id
FROM owner_property
GROUP BY owner_id
HAVING COUNT(property_id) > 1
returns all the owner_ids with more than 1 property_ids.
If there is a case of duplicates in the combination of owner_id and property_id then instead of COUNT(property_id) use COUNT(DISTINCT property_id) in the HAVING clause.
So join it to the other tables:
SELECT b.address, b.region, o.name
FROM (
SELECT owner_id
FROM owner_property
GROUP BY owner_id
HAVING COUNT(property_id) > 1
) t
INNER JOIN owner_property op ON op.owner_id = t.owner_id
INNER JOIN property p ON op.property_id = p.id_property
INNER JOIN owner o ON o.id_owner = op.owner_id
INNER JOIN building b ON p.building_id = b.id_building
ORDER BY op.owner_id, op.property_id ASC;
Always qualify the column names with the table name/alias.
You can try to use a correlated subquery that counts the ownerships with EXISTS in the WHERE clause.
SELECT b1.address,
b1.region,
o1.name
FROM owner_property op1
INNER JOIN owner o1
ON o1.id_owner = op1.owner_id
INNER JOIN property p1
ON p1.id_property = op1.property_id
INNER JOIN building b1
ON b1.id_building = p1.building_id
WHERE EXISTS (SELECT ''
FROM owner_property op2
WHERE op2.owner_id = op1.owner_id
HAVING count(*) > 1);
The query:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
Returns n records in Postgresql:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
I just want to retrieve the number of records returned: 6.
I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
How would I get the number of records in this context right in PostgreSQL?
I think you just need COUNT(DISTINCT post_id) FROM votes.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.
There is also EXISTS:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.
count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.
If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
After ..
ANALYZE posts;
ANALYZE votes;
After ..
CREATE INDEX foo on votes(post_id);
After ..
VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;
count(*) ... WHERE EXISTS
253 ms
220 ms
85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
85 ms
count(DISTINCT x) - long form with join
354 ms
358 ms
373 ms -- (index scan on posts, index scan on votes, merge join)
330 ms
count(DISTINCT x) - short form without join
164 ms
164 ms
164 ms -- (always seq scan)
142 ms
Best time for original query in question:
353 ms
For simplified version:
348 ms
#wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
Select first row in each GROUP BY group?
Using OVER() and LIMIT 1:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;
For followers, I like the OP's inner query method:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
Since then you can use HAVING in there as well:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id HAVING count(*) > 1
) as x;
Or the equivalent CTE
with posts_coalesced as (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id )
select count(*) from posts_coalesced;
I have 3 tables with the following schema
Table1 - plan
id | count
Table2 - subscription
id | plan_id
Table3 - users
id | subscription_id
I have to select count from plans where its plan.id matches with subscription.plan_id matches and users.subscription_id matches with subscription.id and users.id = '5'
How can I use a JOIN here?
I tried using with no JOIN as below:
SELECT count
FROM users, subscription, plan
WHERE users.id = '5' and users.subscription_id = subscription.id and subscription.plan_id = plan.id;
Try to avoid implicit sql joins rather use explicit joins (e.g. inner join,left join..) in order to retain readability.
A cleaner way would be the following:
SELECT
count
FROM
users
INNER JOIN subscription ON users.subscription_id = subscription.id
INNER JOIN plan ON subscription.plan_id = plan.id
WHERE users.id = '5'
Explicit vs implicit SQL joins
See query below
Select count(*) FROM
(Select distinct Student_ID, Name, Student_Age, CourseID from student) a1
JOIN
(Select distinct CourseID, CourseName, TeacherID from courses) a2
ON a1.CourseID=a2.CourseID
JOIN
(Select distinct TeacherID, TeacherName, Teacher_Age from teachers) a3
ON a2.TeacherID=a3.TeacherID
The subqueries must be used for deduping purpose.
This query run fine in PostgreSQL. However, if I add a condition between the student and teacher table, according to the execution plan, Postgres will wrongly nested loop join the student and teach tables which have no direct relationship. For example:
Select count(*) FROM
(Select distinct Student_ID, Name, Student_Age, CourseID from student) a1
JOIN
(Select distinct CourseID, CourseName, TeacherID from courses) a2
ON a1.CourseID=a2.CourseID
JOIN
(Select distinct TeacherID, TeacherName, Teacher_Age from teachers) a3 ON
a2.TeacherID=a3.TeacherID
WHERE Teacher_Age>=Student_Age
This query will take forever to run. However, if I replace the subqueries with the tables, it'll run very fast. Without using temp tables to store the deduping result, is there a way to to avoid the nested loop in this situation?
Thank you for your help.
You're making the database perform a lot of unnecessary work to accomplish your goal. Instead of doing 3 different SELECT DISTINCT sub-queries all joined together, try joining the base tables directly to each other and let it handle the DISTINCT part only once. If your tables have proper indexes on the ID fields, this should run rather quick.
SELECT COUNT(1)
FROM (
SELECT DISTINCT s.Student_ID, c.CourseID, t.TeacherID
FROM student s
JOIN courses c ON s.CourseID = c.CourseID
JOIN teachers t ON c.TeacherID = t.TeacherID
WHERE t.Teacher_Age >= s.StudentAge
) a