Postgresql: Flagging and Identifying Duplicates - postgresql

I'm trying to find a way to mark duplicated cases similar to this question.
However, instead of counting occurrences of duplicated values, I'd like to mark them as 0 and 1, for duplicated and unique cases respectively. This is very similar to SPSS's identify duplicate cases function. For example if I have a dataset like:
Name State Gender
John TX M
Katniss DC F
Noah CA M
Katniss CA F
John SD M
Ariel FL F
And if I wanted to flag those with duplicated name, so the output would be something like this:
Name State Gender Dup
John TX M 1
Katniss DC F 1
Noah CA M 1
Katniss CA F 0
John SD M 0
Ariel FL F 1
A bonus would be a query statement that will handle which case to pick when determining the unique case.

SELECT name, state, gender
, NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid) AS Is_not_a_dup
FROM names na
;
Explanation: [NOT] EXISTS(...) results in a boolean value (which could be converted to an integer) Casting to boolean requires an extra pair of () , though:
SELECT name, state, gender
, (NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid))::integer AS is_not_a_dup
FROM names na
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | t
Katniss | DC | F | t
Noah | CA | M | t
Katniss | CA | F | f
John | SD | M | f
Ariel | FL | F | t
(6 rows)
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | 1
Katniss | DC | F | 1
Noah | CA | M | 1
Katniss | CA | F | 0
John | SD | M | 0
Ariel | FL | F | 1
(6 rows)

Related

How to compute frequency/count of concurrent events by combination in postgresql?

I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).
toy dataframe
+------------------+
|name startts endts|
| A 02:20 02:23 |
| A 02:23 02:25 |
| A 02:27 02:28 |
| B 02:20 02:23 |
| B 02:23 02:25 |
| B 02:25 02:27 |
| C 02:27 02:28 |
| D 02:27 02:28 |
| D 02:28 02:31 |
| E 02:27 02:28 |
| E 02:29 02:31 |
+------------------+
Ideal output:
+---------------------------+
|combination| count |
+---------------------------+
| AB | 2 |
| AC | 1 |
| AE | 1 |
| AD | 1 |
| BC | 0 |
| BD | 0 |
| BE | 0 |
| CE | 0 |
+-----------+---------------+
Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.
What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).
User #GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.
#GMB's attempt:
I understand this as a self-join, aggregation, and a conditional count of matching intervals:
select t1.name name1, t2.name name2,
sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
from mytable t1
inner join mytable t2 on t2.name > t1.name
group by t1.name, t2.name
order by t1.name, t2.name
Demo on DB Fiddle:
name1 | name2 | cnt
:---- | :---- | --:
A | B | 2
A | C | 1
A | D | 1
A | E | 1
B | C | 0
B | D | 0
B | E | 0
C | D | 1
C | E | 1
D | E | 1
#GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:
sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt
Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097
Thank you.
Consider the following in MySQL (where your DBFiddle points to):
SELECT name, COUNT(*)
FROM (
SELECT group_concat(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
Equivalent in PostgreSQL:
SELECT name, COUNT(*)
FROM (
SELECT string_agg(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
First, you create a list of concurrent events (in the subquery), and then you count them.

Create timeline chart based on annual data

How does one automatically create a continuous chart over time based on data that is only available as each year?
For example, most data comes in the form of the following:
j | f | m | a | m | j | j | a | s | o | n | d |
year 1 x y
year 2 z
year 3
However, in order to create a chart than spans multiple years, I need the data transposed and combined as below:
year 1 | j x
| f y
| m
| a
| m
| j
| j
| a
| s
| o
| n
| d
year 2 | j
| f
| m z
| a
| m
| j
| j
| a
| s
| o
| n
| d
year 3 | j
| f
| m
| a
| m
| j
| j
| a
| s
| o
| n
| d
Is there a simple way to do this with pivot tables or something else?
Assuming your data is in Sheet1. In Sheet2 A2 put this:
=TRANSPOSE(SPLIT(ArrayFormula(JOIN(" , , , , , , , , , , , ,",FILTER(Sheet1!A2:A,Sheet1!A2:A <> ""))),","))
In Sheet2 B2 put:
=ArrayFormula(transpose(split(rept(join(",",transpose(Sheet1!B1:N1)),countA(Sheet1!A2:A)),",")))
These will expand if years are added.
To get the data, this will work, however, a query will need to be added for each additional year:
=transpose({query(Sheet1!B2:M2,"select *"),query(Sheet1!B3:M3,"select *"),query(Sheet1!B4:M4,"select *")})

SUM of two level group by in postgresql

I have three table as given below
student
id name stand_id sub_id gender
---------------------------------------
1 | Joe | 1 | 1 | M
2 | Saun | 2 | 1 | F
3 | Paul | 1 | 2 | F
4 | Sena | 2 | 2 | M
Subject
id name
1 Math
2 English
Standard
id name
1 First
2 Second
How can I achieve this kind of multiple group by like standard, subject than total number of boys and girls.
Should I use with, union or union all ?
First
Math
boys total
girls total
second
math
boys total
girls total
It's not completely clear what you are attempting. My interpretation is that you are looking for the total of students by standard, subject and gender.
If that is correct, you need to join together the tables and count the students at the appropriate grain, like so:
SELECT
sta.name AS standard_name,
sub.name AS subject_name,
CASE stu.gender WHEN 'M' THEN 'Boys' ELSE 'Girls' END AS student_gender,
COUNT(stu.id) AS total
FROM
student stu
JOIN
subject sub
ON (stu.sub_id = sub.id)
JOIN
standard sta
ON (stu.stand_id = sta.id)
GROUP BY
standard_name,
subject_name,
student_gender;
Based on your sample data, it would return this:
standard_name | subject_name | student_gender | total
-----------------------------------------------------
First | Math | Boys | 1
First | English | Girls | 1
Second | Math | Girls | 1
Second | English | Boys | 1
Is it what you are looking for
SELECT sd.name,
sj.name,
count(st.gender) filter (
WHERE st.gender='M') AS MALE,
count(st.gender) filter (
WHERE st.gender='F') AS FEMALE
FROM Standard sd
INNER JOIN Student st ON (st.stand_id=sd.id)
INNER JOIN Subject sj ON (sj.id=st.sub_id)
GROUP BY sd.name,
sj.name;
name | name | male | female
--------+---------+------+--------
First | Math | 1 | 0
First | English | 0 | 1
Second | English | 2 | 1
Second | Math | 0 | 1
(4 rows)
I have added some more rows to second English.

Add New Line Character with Multiple Columns in T-SQL

I have a table that has ID, AddrID, and Addr columns.
The ID can be attached to multiple Addr values in which each address has it own ID.
I am trying to make it so that there is a new line to each ID when it has multiple Addresses loaded And not repeat the ID. So in essence not each row for every record.
Hope it makes sense.
This will eventually become an SSRS report.
The desired output would be something as so:
+----+--------+------------+
| ID | AddrID | Addr |
+----+--------+------------+
| 1 | S1 | 123 N St |
| 2 | S2 | 456 S ST |
| | S3 | 789 W ST |
| | S4 | 987 E ST |
| 3 | S1 | 123 N St |
| | S5 | 147 Elm ST |
| | S6 | 258 SQL St |
+----+--------+------------+
I tried to use:
declare #nl as char(2) = char(13) + char(10)
but its just not working.
Presentation should be done in the presentation layer (Reporting Services in this instance) not in the database or query.
You can do this two ways:
Grouping
Add a Row Group on ID and this will happen automatically.
Expression
You can hide the ID field by putting an expression on the Visibility-Hidden property:
=Fields!ID.Value = Previous(Fields!ID.Value)
This hides the ID field if it is the same as the one on the previous row.

Ranking rows according some number

I have table like this
id | name
----------
1 | A
2 | B
5 | C
100 | D
200 | E
201 | F
202 | G
I need ranking rows from 1 to 3 order by id, that is, I need result:
id | name | ranking
---------------------------
1 | A | 1
2 | B | 2
5 | C | 3
100 | D | 1
200 | E | 2
201 | F | 3
202 | G | 1
How to make this?
P.S.
I am trying:
SELECT id, name, row_number() OVER( order by id RANGE BETWEEN 1 AND 3 ) AS ranking FROM t
This gives syntax error.
RANGE is actually used for something else:
http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
http://www.postgresql.org/docs/current/static/sql-select.html
Try using a modulus instead:
SELECT id, name, 1 + (row_number() OVER( order by id ) - 1) % 3 AS ranking
FROM t