SQL count distinct id too slow (~7 seconds)

SQL count distinct id too slow (~7 seconds) - postgresql

I have a query as such:
SELECT disease_name, COUNT(DISTINCT id)
FROM disease_table
GROUP BY disease_name
where each disease_name has an associated identifier, and a disease may occur multiple times for the same identifier.
This works, BUT it takes roughly 7s to run.
If I run this query:
SELECT disease_name, COUNT(disease_name)
FROM disease_table
GROUP BY disease_name
it takes 321ms, BUT duplicate rows (same disease with same id) are counted more than once.
Is there a more efficient way to achieve the results of the first query in about the same time as the second using only SQL?
Table:
disease_name | id
------------ | -------------
dis_1 123
dis_1 104
dis_1 104
dis_32 123
dis_12 123
dis_12 115
Expected:
disease_name | count
------------ | -------------
dis_1 2
dis_32 1
dis_12 2
where dis_1 has 3 entries but is only counted twice because two of those 3 entries have the same id

Try to add a proper index on disease_table, like this:
CREATE INDEX ON disease_table(disease_name, id);
See if that solves out your issue.

Related

How can 'brand new, never before seen' IDs be counted per month in redshift?

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.

First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

Filter rows based on two fields, where one of them contains a selection criterion

Given the following table
group | weight | category_id | category_name_plus
1 10 100 Ab
1 20 101 Bcd
1 30 100 Efghij
2 10 101 Bcd
2 20 101 Cdef
2 30 100 Defgh
2 40 100 Ab
3 10 102 Fghijkl
3 20 101 Ab
The "weight" is unique for each group and is also an indicator for the order of records inside the group.
What I want is to retrieve one record per group filtered by category_id, but only the record having the highest "weight" inside its "group".
Example for filtering by category_id = 100:
group | weight | category_id | category_name_plus
1 30 100 Efghij
2 40 100 Ab
Example for filtering by category_id = 101:
group | weight | category_id | category_name_plus
1 20 101 Bcd
2 20 101 Cdef
3 20 101 Ab
How can I select just these rows?
I tried fiddling with UNIQUE, MAX(category_id) etc. but I'm still unable to get the correct results. The main problem for me is to get the category_name_plus value here.
I am working with PostgreSQL 9.4(beta 3), because I also need various other niceties like "WITH ORDINALITY" etc.

The rank window function should do the trick:
SELECT "group", weight, category_id, category_name_plus
FROM (SELECT "group", weight, category_id, category_name_plus,
RANK() OVER (PARTITION BY "group"
ORDER BY weight DESC) AS rk
FROM my_table) t
WHERE rk = 1 AND category_id = 101
Note:
"group" is a reserved word in SQL, so it has to be surrounded by quotes in order to be used as a column name. It would probably be better, though, to replace it with a non-reserved word, such as "group_id".

Try something like:
SELECT DISTINCT ON (category_id) *
from your_table
order by category_id, weight desc

Subselect on array_agg in postgresql

Is there a way to use a value from an aggregate function in a having clause in Postgresql 9.2+?
For example, I would like to get each monkey_id with a 2nd highest number > 123, as well as the second highest number. In the example below, I'd like to get (monkey_id 1, number 222).
monkey_id | number
------------------
1 | 222
1 | 333
2 | 0
2 | 444
SELECT
monkey_id,
(array_agg(number ORDER BY number desc))[2] as second
FROM monkey_numbers
GROUP BY monkey_id
HAVING second > 123
I get column "second" does not exist.

You will have to place that in the having clause
SELECT
monkey_id
FROM monkey_numbers
GROUP BY monkey_id
HAVING array_agg(number ORDER BY number desc)[2] > 123
The explanation is that the having will be executed before the select so second still doesn't exist at that time.

TSQL Join to get all records from table A for each record in table B?

I have two tables:
PeriodId Period (Periods Table)
-------- -------
1 Week 1
2 Week 2
3 Week 3
EmpId PeriodId ApprovedDate (Worked Table)
----- -------- ------------
1 1 Null
1 2 2/28/2013
2 2 2/28/2013
I am trying to write a query that results in this:
EmpId Period Worked ApprovedDate
----- -------- --------- ------------
1 Week 1 Yes Null
1 Week 2 Yes 2/28/2013
1 Week 3 No Null
2 Week 1 No Null
2 Week 2 Yes 2/28/2013
2 Week 3 No Null
The idea is that I need each Period from the Periods table for each Emp. If there was no record in the Worked table then the 'No' value is placed Worked field.
What does the TSQL look like to get this result?
(Note: if it helps I also have access to an Employee table that has EmpId and LastName for each employee. For performance reasons I'm hoping not to need this but if I do then so be it.)

You should be able to use the following:
select p.empid,
p.period,
case
when w.PeriodId is not null
then 'Yes'
else 'No' End Worked,
w.ApprovedDate
from
(
select p.periodid, p.period, e.empid
from periods p
cross join (select distinct EmpId from worked) e
) p
left join worked w
on p.periodid = w.periodid
and p.empid = w.empid
order by p.empid
See SQL Fiddle with Demo

need help writing a date sensitive T-SQL query

I need help writing a T-SQL query that will generate 52 rows of data per franchise from a table that will often contain gaps in the 52 week sequence per franchise (i.e., the franchise may have reported data bi-weekly or has not been in business for a full year).
The table I'm querying against looks something like this:
FranchiseId | Date | ContractHours | PrivateHours
and I need to join it to a table similar to this:
FranchiseId | Name
The output of the query needs to look like this:
Name | Date | ContractHours | PrivateHours
---- ---------- ------------- ------------
AZ1 08-02-2011 292 897
AZ1 07-26-2011 0 0 -- default to 0's for gaps in sequence
...
AZ1 08-03-2010 45 125 -- row 52 for AZ1
AZ2 08-02-2011 382 239
...
AZ2 07-26-2011 0 0 -- row 52 for AZ2
I need this style of output for every franchise, i.e., 52 rows of data with default rows for any gaps in the 52 week sequence, in a single result set. Thus, if there are 100 franchises, the result set should be 5200 rows.
What I've Tried
I've tried the typical suggestions of:
Create a table with all possible dates
LEFT OUTER JOIN this to the table of data needed
The problems I'm running into are
ensuring that for every franchise their are 52 rows and
filling in gaps with the franchise name and 0 for hours, I can't
have the following in the result set:
Name | Date | ContractHours | PrivateHours
---- ---------- ------------- ------------
NULL 08-02-2011 NULL NULL
I don't know where to go from here? Is there an efficient way to write a T-SQL query that will produce the required output?

The bare bones is this
Generate 52 week ranges
Cross join with Franchise
LEFT JOIN the actual date
ISNULL to substitute zeroes
So, like this, untested
;WITH cDATE AS
(
SELECT
CAST('20100101' AS date /*smalldatetime*/) AS StartOfWeek,
CAST('20100101' AS date /*smalldatetime*/) + 6 AS EndOfWeek
UNION ALL
SELECT StartOfWeek + 7, EndOfWeek + 7
FROM cDATE WHERE StartOfWeek + 7 < '20110101'
), Possibles AS
(
SELECT
StartOfWeek, FranchiseID
FROM
cDATE CROSS JOIN Franchise
)
SELECT
P.FranchiseID,
P.StartOfWeek,
ISNULL(SUM(O.ContractHours), 0),
ISNULL(SUM(O.PrivateHours), 0)
FROM
Possibles P
LEFT JOIN
TheOtherTable O ON P.FranchiseID = O.FranchiseID AND
O.Date BETWEEN P.StartOfWeek AND P.EndOfWeek
GROUP BY
P.FranchiseID

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SQL count distinct id too slow (~7 seconds) - postgresql

Try to add a proper index on disease_table, like this: CREATE INDEX ON disease_table(disease_name, id); See if that solves out your issue.

Related

How can 'brand new, never before seen' IDs be counted per month in redshift?

Filter rows based on two fields, where one of them contains a selection criterion

Subselect on array_agg in postgresql

TSQL Join to get all records from table A for each record in table B?

need help writing a date sensitive T-SQL query

Categories

Resources