I have a table with users. Each user has a country. What I want is to get the list of all countries with the numbers of users and the percent/total. What I have so far is:
SELECT
country_id,
COUNT(*) AS total,
((COUNT(*) * 100) / (SELECT COUNT(*) FROM users WHERE cond1 = true AND cond2 = true AND cond3 = true)::decimal) AS percent
FROM users
WHERE cond1 = true AND cond2 = true AND cond3 = true
GROUP BY contry_id
Conditions in both of queries are the same. I tried to do this without a subquery but then I can't get the total number of users but total per country. Is there a way to do this without a subquery? I'm using PostgreSQL. Any help is highly appreciated.
Thanks in advance
I guess the reason you want to eliminate the subquery is to avoid scanning the users table twice. Remember the total is the sum of the counts for each country.
WITH c AS (
SELECT
country_id,
count(*) AS cnt
FROM users
WHERE cond1=...
GROUP BY country_id
)
SELECT
*,
100.0 * cnt / (SELECT sum(cnt) FROM c) AS percent
FROM c;
This query builds a small CTE with the per-country statistics. It will only scan the users table once, and generate a small result set (only one row per country).
The total (SELECT sum(cnt) FROM c) is calculated only once on this small result set, so it uses negligible time.
You could also use a window function :
SELECT
country_id,
cnt,
100.0 * cnt / (sum(cnt) OVER ()) AS percent
FROM (
SELECT country_id, count(*) as cnt from users group by country_id
) foo;
(which is the same as nightwolf's query with the errors removed lol )
Both queries take about the same time.
This is really old, but both of the select examples above either don't work, or are overly complex.
SELECT
country_id,
COUNT(*),
(COUNT(*) / (SUM(COUNT(*)) OVER() )) * 100
FROM
users
WHERE
cond1 = true AND cond2 = true AND cond3 = true
GROUP BY
country_id
The second count is not necessary, it's just for debugging to ensure you're getting the right results. The trick is the SUM on top of the COUNT over the recordset.
Hope this helps someone.
Also, if anyone wants to do this in Django, just hack up an aggregate:
class PercentageOverRecordCount(Aggregate):
function = 'OVER'
template = '(COUNT(*) / (SUM(COUNT(*)) OVER() )) * 100'
def __init__(self, expression, **extra):
super().__init__(
expression,
output_field=DecimalField(),
**extra
)
Now it can be used in annotate.
I am not a PostgreSQL user but, the general solution would be to use window functions.
Read up on how to use this at http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
Best explanation i could use to describe it is: basically it allows you to do a group by on one field without the group by clause.
I believe this might do the trick:
SELECT
country_id,
COUNT(*) OVER (country_id)
((((COUNT(*) OVER (country_id)) * 100) / COUNT(*) OVER () )::decimal) as percent
FROM
users
WHERE
cond1 = true AND cond2 = true AND cond3 = true
Using last PostgreSQL version the query can be next:
CREATE TABLE users (
id serial,
country_id int
);
INSERT INTO users (country_id) VALUES (1),(1),(1),(2),(2),(3);
select distinct
country_id,
round(
((COUNT(*) OVER (partition by country_id )) * 100)::numeric
/ COUNT(*) OVER ()
, 2) as percent
from users
order by country_id
;
Result on SQLize.online
+============+=========+
| country_id | percent |
+============+=========+
| 1 | 50.00 |
+------------+---------+
| 2 | 33.33 |
+------------+---------+
| 3 | 16.67 |
+------------+---------+
Related
I have a CTE that i want to grab data from, but i want different types of data with the same limit from the same CTE according to different rules.
Example: fruit_cte -> (id::integer, name::text, q1::boolean, q2::boolean)
I could do something like:
SELECT * FROM (SELECT 1 as query_num, * FROM fruit_cte WHERE q1 ORDER BY name LIMIT 100) as ABC
UNION ALL
SELECT * FROM (SELECT 2 as query_num, * FROM fruit_cte WHERE q2 ORDER BY name LIMIT 100) as ABC
UNION ALL
SELECT * FROM (SELECT -1 as query_num, * FROM fruit_cte WHERE NOT q1 AND NOT q2 ORDER BY name LIMIT 100) as ABC
But this is very costly and would be nice to tie this up into 1 select. Is this even possible?
The last select is a nice to have to get data that doesn't meet the requirements but possible to go without.
PG version 11+
You could get it all without the CTE, by using window functions instead.
SELECT type, id, name, q1, q2
FROM (
SELECT
CASE WHEN q1 THEN 1 WHEN q2 THEN 2 ELSE -1 END AS type,
ROW_NUMBER() OVER (
PARTITION BY CASE WHEN q1 THEN 1 WHEN q2 THEN 2 ELSE -1 END
ORDER BY NAME
) AS row_number,
id,
name,
q1,
q2
FROM ...
)
WHERE row_number <= 100
The row_number() will count, sorted by name, and keep a separate tally for every type
I have a table with people and another with visits. I want to count all visits but if the person signed up with 'emp' or 'oth' on ref_signup then remove the first visit. Example:
This are my tables:
PEOPLE:
id | ref_signup
---------------------
20 | emp
30 | oth
23 | fri
VISITS
id | date
-------------------------
20 | 10-01-2019
20 | 10-05-2019
23 | 10-09-2019
23 | 10-10-2019
30 | 09-10-2019
30 | 10-07-2019
On this example the visit count should be 4 because persons with id's 20 and 30 have their ref_signup as emp or oth, so it should exclude their first visit, but count from the second and forward.
This is what I have as a query:
SELECT COUNT(*) as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1
Would using a case on the count help on this case as I just want to remove one visit not all of the visits from the person.
Subtract from COUNT(*) the distinct number of person.ids with person.ref_signup IN ('emp', 'oth'):
SELECT
COUNT(*) -
COUNT(DISTINCT CASE WHEN p.ref_signup IN ('emp', 'oth') THEN p.id END) as visit_count
FROM visits v LEFT JOIN people p
ON p.id = v.id
See the demo.
Result:
| visit_count |
| ----------- |
| 4 |
Note: this code and demo fiddle use the column names of your sample data.
Premise, select the count of visits from each person, along with a synthetic column that contains a 1 if the referral was from emp or oth, a 0 otherwise. Select the sum of the count minus the sum of that column.
SELECT SUM(count) - SUM(ignore_first) FROM (SELECT COUNT(*) as count, CASE WHEN ref_signup in ('emp', 'oth') THEN 1 ELSE 0 END as ignore_first as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1 GROUP BY id) a
where's "people_id" in your example ?
SELECT COUNT(*) as visit_count
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth');
then remove the first visit.
You cannot select count and delete the first visit at same time.
DELETE FROM visits
WHERE id IN (
SELECT id
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth')
ORDER BY v.id
LIMIT 1
);
edit: typos
First, I create the tables
create table people (id int primary key, ref_signup varchar(3));
insert into people (id, ref_signup) values (20, 'emp'), (30, 'oth'), (23, 'fri');
create table visits (people_id int not null, visit_date date not null);
insert into visits (people_id, visit_date) values (20, '10-01-2019'), (20, '10-05-2019'), (23, '10-09-2019'), (23, '10-10-2019'), (30, '09-10-2019'), (30, '10-07-2019');
You can use the row_number() window function to mark which visit is "visit number one":
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
Once you have that, you can do another query on those results, and use the filter clause to count up the correct rows that match the condition where visit_num > 1 or ref_signup = 'fri':
-- wrap the first query in a WITH clause
with joined_visits as (
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
)
select count(1) filter (where visit_num > 1 or ref_signup = 'fri')
from joined_visits;
-- First get the corrected counts for all users
WITH grouped_visits AS (
SELECT
COUNT(visits.*) -
CASE WHEN people.ref_signup IN ('emp', 'oth') THEN 1 ELSE 0 END
AS visit_count
FROM visits
INNER JOIN people ON (people.id = visits.id)
GROUP BY people.id, people.ref_signup
)
-- Then sum them
SELECT SUM(visit_count)
FROM grouped_visits;
This should give you the result you're looking for.
On a side note, I can't help but think clever use of a window function could do this in a single shot without the CTE.
EDIT: No, it can't since window functions run after needed WHERE and GROUP BY and HAVING clauses.
I have a table that list individual items and the amount we billed for them. We receive a payment that may be less than the total amount billed. I want to allocate that payment to each item in proportion to the original billed amount.
Here's the tricky part.
Each individual paid amount can not have fractional cents
The sum of the individual paid amounts must still add up to the TotalPaid Amount.
Setting up the data:
declare #t table
(
id varchar(4) primary key,
Billed money not null
)
insert into #t
(
id,
billed
)
values
( 'A', 5),
( 'B', 3),
( 'C', 2)
declare #TotalPaid money
Set #TotalPaid = 3.33
This way doesn't work
SELECT
ID,
Round(#TotalPaid * Billed / (Select Sum(Billed) from #t), 2)
From
#T
it will return:
A 1.67
C 1
D 0.67
-----
3.34 <--- Note the sum doesn't equal the Total Paid
I know I can accomplish this via a cursor or a loop, keeping track of the unallocated amount at each step and insuring that after the last item the entire TotalPaid amount is allocated.
However I was hoping there was a way to do this without a loop or cursors.
This is a greatly simplified version of the problem I'm trying to address. The actual data has over 100K rows and the cursor approach is really slow.
I think this is a viable approach...
(Pass 1 as the third parameter to ROUND to ensure rounding is always down then distribute the odd 0.01s that make up the balance to ones where the difference between the rounded amount and the ideal amount is the greatest)
WITH t1
AS (SELECT *,
billed_adj = #TotalPaid * Billed / Sum(Billed) OVER(),
billed_adj_trunc = ROUND(#TotalPaid * Billed / Sum(Billed) OVER(), 2, 1)
FROM #t)
SELECT id,
billed,
billed_adj_trunc + CASE
WHEN ROW_NUMBER() OVER (ORDER BY billed_adj - billed_adj_trunc DESC)
<= 100 * ( #TotalPaid - SUM(billed_adj_trunc) OVER() )
THEN 0.01
ELSE 0
END
FROM t1
ORDER BY id
Here is a (somewhat complicated) solution using a recursive common table expression
;with cte as (
select
id
, Paid = round(#TotalPaid * Billed / (Select Sum(Billed) from #t), 2,1)
, Remainder = #TotalPaid * Billed / (Select Sum(Billed) from #t)
- round(#TotalPaid * Billed / (Select Sum(Billed) from #t), 2,1)
, x.next_id
from #t t
outer apply (
select top 1 next_id = i.id
from #t as i
where i.id > t.id
order by i.id asc
) x
)
, r_cte as (
--anchor row(s) / starting row(s)
select
id
, Paid
, Remainder
, next_id
from cte t
where not exists (
select 1
from cte as i
where i.id < t.id
)
union all
--recursion starts here
select
c.id
, c.Paid + round(c.Remainder + p.Remainder,2,1)
, Remainder = c.Remainder + p.Remainder - round(c.Remainder + p.Remainder,2,1)
, c.next_id
from cte c
inner join r_cte p
on c.id = p.next_id
)
select id, paid
from r_cte
rextester demo: http://rextester.com/MKLDX88496
returns:
+----+------+
| id | paid |
+----+------+
| A | 1.66 |
| B | 1.00 |
| C | 0.67 |
+----+------+
For something like this you are not going to able to apply an exact distribution; as you hvae already shown the rounding results in the total exceeding the payment received.
You will therefore need to distribute "whatever is left" to the final [Billed], so you'll need to do 2 things...
Determine if the current row is the final row in that group.
Determine how much of the payment has already been distributed.
You don't give much data to work with here, so the following is not ideal, however this is along the lines of what you want...
SELECT
ID,
CASE WHEN lead(billed,1) OVER(ORDER BY (SELECT 1)) IS NULL THEN #TotalPaid - (sum(round(#TotalPaid * Billed / (Select Sum(Billed) from #t),2)) OVER(ORDER BY (SELECT 1) ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING))
ELSE round(#TotalPaid * Billed / (Select Sum(Billed) from #t),2)
END AS solution
FROM
#T;
Note that if the A,B,C then has a higher key this would make up the "group" so you would adjust the window functions accordingly. If you could supply some more sample data with additional columns etc. I could maybe come up with a more elegant solution.
I would like to select the top 1% of rows; however, I cannot use subqueries to do it. I.e., this won't work:
SELECT * FROM mytbl
WHERE var='value'
ORDER BY id,random()
LIMIT(SELECT (COUNT(*) * 0.01)::integer FROM mytbl)
How would I accomplish the same output without using a subquery with limit?
You can utilize PERCENT_RANK:
WITH cte(ID, var, pc) AS
(
SELECT ID, var, PERCENT_RANK() OVER (ORDER BY random()) AS pc
FROM mytbl
WHERE var = 'value'
)
SELECT *
FROM cte
WHERE pc <= 0.01
ORDER BY id;
SqlFiddleDemo
I solved it with Python using the psycopg2 package:
cur.execute("SELECT ROUND(COUNT(id)*0.01,0)
FROM mytbl")
nrows = str([int(d[0]) for d in cur.fetchall()][0])
cur.execute("SELECT *
FROM mytbl
WHERE var='value'
ORDER BY id, random() LIMIT (%s)",nrows)
Perhaps there is a more elegant solution using just SQL, or a more efficient one, but this does exactly what I'm looking for.
If I got it right, you need:
Random 1% sample of all rows,
If some id is within the sample, all rows with the same id must be there too.
The follow sql should do the trick:
with ids as (
select id,
total,
sum(cnt) over (order by max(rnd)) running_total
from (
select id,
count(*) over (partition by id) cnt,
count(*) over () total,
row_number() over(order by random()) rnd
from mytbl
) q
group by id,
cnt,
total
)
select mytbl.*
from mytbl,
ids
where mytbl.id = ids.id
and ids.running_total <= ids.total * 0.01
order by mytbl.id;
I don’t have your data, of course, but I have no trouble using a sub query in the LIMIT clause.
However, the sub query contains only the count(*) part and I then multiply the result by 0.01:
SELECT * FROM mytbl
WHERE var='value'
ORDER BY id,random()
LIMIT(SELECT count(*) FROM mytbl)*0.01;
Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle