Create dummy variables in PostgreSQL - postgresql

Is it possible to create a dummy variable when querying
For instance the query below will give me only the observations that satisfy the var1 conditions. I also want the remaining observations but with some kind of tag on it (0/1, indicator values would be sufficient)
SELECT distinct ON (id) id,var1,var2,var3
FROM table
where var2 = ANY('{blue,yellow}');
Have
+-----+------+--------+------+
| id | Var1 | Var2 | Var3 |
+-----+------+--------+------+
| 345 | 12 | Blue | 3456 |
| 345 | 12 | Red | 2134 |
| 346 | 45 | Blue | 3451 |
| 347 | 25 | yellow | 1526 |
+-----+------+--------+------+
Want
+-----+------+--------+------+--------------------+
| id | Var1 | Var2 | Var3 | Indicator variable |
+-----+------+--------+------+--------------------+
| 345 | 12 | Blue | 3456 | 1 |
| 345 | 12 | Red | 2134 | 0 |
| 346 | 45 | Blue | 3451 | 1 |
| 347 | 25 | yellow | 1526 | 1 |
+-----+------+--------+------+--------------------+

Instead of a expression in where you can use an expression in select output expressions:
=> select a, a = any('{1,2,3,5,7}') as asmallprime
from generate_series(1,10) as a;
a | asmallprime
----+-------------
1 | t
2 | t
3 | t
4 | f
5 | t
6 | f
7 | t
8 | f
9 | f
10 | f
(10 rows)

Tometzky's answer is sufficient, but if you want something more complex you can also use CASE statements.
Tometzky's example using CASE with an extra indicator
SELECT a, CASE WHEN a = any('{1,2,3,5,7}') THEN 'YES'
WHEN a = any('{4,9}') THEN 'SQUARE' ELSE 'NO' END as asmallprime
FROM generate_series(1,10) as a;

Related

postgres sum different table columns from many to one joined data

Suppose I have the following two tables:
foo:
| id | goober | value |
|----|--------|-------|
| 1 | a1 | 25 |
| 2 | a1 | 125 |
| 3 | b2 | 500 |
bar:
| id | foo_id | value |
|----|--------|-------|
| 1 | 1 | 4 |
| 2 | 3 | 19 |
| 3 | 3 | 42 |
| 4 | 3 | 22 |
| 5 | 3 | 56 |
Note the n:1 relationship of bar.foo_id : foo.id.
My goal is to sum the value columns for tables foo and bar, joining on bar.foo_id=foo.id, and finally grouping by goober from foo. Then performing a calculation if possible, though not critical.
Resulting in a final output looking something like:
| goober | foo_value_sum | bar_value_sum | foo_bar_diff |
|--------|---------------|---------------|--------------|
| a1 | 150 | 4 | 146 |
| b2 | 500 | 139 | 361 |
This should be rather simple by the following query that creates two CTEs and then joins them afterwards:
with bar_agg as
(
select foo.goober
,sum(bar.value) bar_value_sum
from foo
join bar
on bar.foo_id = foo.id
group by foo.goober
)
,foo_agg as
(
select foo.goober
,sum(foo.value) foo_value_sum
from foo
group by foo.goober
)
select foo.goober
,foo_value_sum
,bar_value_sum
,foo_value_sum - bar_value_sum foo_bar_diff
from foo_agg foo
left join bar_agg bar
on bar.goober = foo.goober
order by foo.goober

Serial Number in logical order without gaps

I'm trying to generate a serial number based on a few conditions.
My dataset:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | |
+--------+------------+------------+---------+--------+
What I did to configure my serial number:
RANK() OVER(PARTITION BY Client ORDER BY Client, Start_date ASC)
So now it generates a serial number for my which looks like this:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | 3 |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | 4 |
+--------+------------+------------+---------+--------+
What goes wrong for my analysis is the last line, it generates the serial number. What it has to be is 3.
Can anayone help me to generate it in this order?
Thanks in advance!
Extra
In addition to my question from yesterday, there is something extra that I need to do. Because the Ser_No has to be the same when my Start_Date is the same, but the Ser_No has also be the same when my folowing records is the same product (also when it has a different Start_Date)
So what I I expect and what I get right now:
+--------+------------+------------+---------+--------+------------+
| Client | Start_Date | End_date | Product | Ser_No | Ser_No New |
+--------+------------+------------+---------+--------+------------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 44 | 12-03-2018 | 01-01-2019 | C | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-03-2018 | 31-01-2019 | F | 3 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 11-04-2018 | 31-03-2019 | F | 4 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 20-04-2018 | 31-01-2019 | G | 5 | 4 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | A | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | B | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-05-2018 | 31-01-2019 | B | 7 | 5 |
+--------+------------+------------+---------+--------+------------+
Any idea on how to achieve this, because I won't get it
You need to use DENSE_RANK instead:
This function returns the rank of each row within a result set partition, with no gaps in the ranking values.
DENSE_RANK() OVER(PARTITION BY Client ORDER BY Start_date) AS Ser_no
Additionaly the Client in ORDER BY has no effect because it has the same value per partition.

Sort using auxiliary fields, start and end

In PostgreSQL, what is the best way to sort records using start and end fields in a generic way, without the need to include in the query the first record (where start_id=3)?
Example table:
+-------+----------+--------+--------+
| FK_ID | START_ID | END_ID | STRING |
+-------+----------+--------+--------+
| 77 | 1 | 9 | E |
| 82 | 5 | 2 | A |
| 77 | 7 | 1 | I |
| 77 | 3 | 7 | W |
| 82 | 9 | 5 | Q |
| 77 | 9 | 5 | X |
| 82 | 2 | 7 | G |
+-------+----------+--------+--------+
Sorted where FK_ID = 77:
+----+---+---+---+
| 77 | 3 | 7 | W |
| 77 | 7 | 1 | I |
| 77 | 1 | 9 | E |
| 77 | 9 | 5 | X |
+----+---+---+---+
Sorted where FK_ID = 82:
+----+---+---+---+
| 82 | 9 | 5 | Q |
| 82 | 5 | 2 | A |
| 82 | 2 | 7 | G |
+----+---+---+---+
Result query sequence:
+-------+----------+
| FK_ID | SEQUENCE |
+-------+----------+
| 82 | QAG |
| 77 | WIEX |
+-------+----------+
I do not think this is the most efficient way but you can try with a recursive CTE
WITH RECURSIVE path AS (
SELECT * FROM myTable AS t1 WHERE NOT EXISTS(
SELECT 1 FROM myTable AS t2 WHERE t1.fk_id = t2.fk_id AND t2.end_id = t1.start_id
) ORDER BY start_id LIMIT 1
UNION ALL
SELECT myTable.* FROM myTable JOIN path ON path.end_id = myTable.start_id
)
SELECT fk_id,array_to_string(array_agg(string)) FROM path GROUP BY fk_id

Comparing Subqueries

I have two subqueries. Here is the output of subquery A....
id | date_lat_lng | stat_total | rnum
-------+--------------------+------------+------
16820 | 2016_10_05_10_3802 | 9 | 2
15701 | 2016_10_05_10_3802 | 9 | 3
16821 | 2016_10_05_11_3802 | 16 | 2
17861 | 2016_10_05_11_3802 | 16 | 3
16840 | 2016_10_05_12_3683 | 42 | 2
17831 | 2016_10_05_12_3767 | 0 | 2
17862 | 2016_10_05_12_3802 | 11 | 2
17888 | 2016_10_05_13_3683 | 35 | 2
17833 | 2016_10_05_13_3767 | 24 | 2
16823 | 2016_10_05_13_3802 | 24 | 2
and subquery B, in which date_lat_lng and stat_total has commonality with subquery A, but id does not.
id | date_lat_lng | stat_total | rnum
-------+--------------------+------------+------
17860 | 2016_10_05_10_3802 | 9 | 1
15702 | 2016_10_05_11_3802 | 16 | 1
17887 | 2016_10_05_12_3683 | 42 | 1
15630 | 2016_10_05_12_3767 | 20 | 1
16822 | 2016_10_05_12_3802 | 20 | 1
16841 | 2016_10_05_13_3683 | 35 | 1
15632 | 2016_10_05_13_3767 | 23 | 1
17863 | 2016_10_05_13_3802 | 3 | 1
16842 | 2016_10_05_14_3683 | 32 | 1
15633 | 2016_10_05_14_3767 | 12 | 1
Both subquery A and B pull data from the same table. I want to delete the rows in that table that share the same ID as subquery A but only where date_lat_lng and stat_total have a shared match in subquery B.
Effectively I need:
DELETE FROM table WHERE
id IN
(SELECT id FROM (subqueryA) WHERE
subqueryA.date_lat_lng=subqueryB.date_lat_lng
AND subqueryA.stat_total=subqueryB.stat_total)
Except I'm not sure where to place subquery B, or if I need an entirely different structure.
Something like this,
DELETE FROM table WHERE
id IN (
SELECT DISTINCT id
FROM subqueryA
JOIN subqueryB
USING (id,date_lat_lng,stat_total)
)

how to return number of records as a part of a select statement?

I'd like to know if there is a way to include row numbers (basically telling me how many records I'm getting back from a database query).
I have the following SQL query
SELECT w.widget_id, w.class_id, wg.name classname, wg.label AS classgroup, c.label, c.seq,
g.name AS group, p.name, p.type, CASE WHEN v.value IS NOT NULL THEN v.value WHEN g2p.value IS NOT NULL THEN g2p.value ELSE p.value END AS value
FROM widgets_to_categories w
INNER JOIN widget_classes c ON w.class_id = c.class_id
JOIN classes_to_param_groups t2g ON c.class_id = t2g.class_id
JOIN widget_groups g ON t2g.group_id = g.group_id
JOIN param_groups_to_params g2p ON t2g.group_id = g2p.group_id
JOIN provisioning_params p ON g2p.param_id = p.param_id
INNER JOIN widget_cat_groups wg ON c.class_group_id = wg.class_group_id
LEFT JOIN widget_values v ON(w.widget_id=v.device_id AND p.param_id=v.param_id AND g.name=v.group_name )
WHERE w.widget_id=8 ORDER BY c.class_id ASC
And it returns data like:
widget_id | class_id | classname | classgroup | label | seq | group | name | type | value
8 | 1 | toy | group A | test label | 1 | toy | reg | text | af
8 | 1 | toy | group A | test label | 1 | reg2 | fall | text | 25327
8 | 1 | toy | group A | test label | 1 | reg2 | pd | text | dvaa
8 | 1 | toy | group A | test label | 1 | reg2 | ext | text | 28235
8 | 1 | toy | group A | test label | 1 | reg1 | ext | text | 28230
8 | 1 | toy | group A | test label | 1 | toy | meec | text | 094F22DE501
8 | 1 | toy | group A | test label | 1 | toy | mmap | text | 0|
8 | 1 | toy | group A | test label | 1 | reg1 | fna | text | 26014
8 | 1 | toy | group A | test label | 1 | reg1 | fall | text | t-123
8 | 1 | toy | group A | test label | 1 | toy | uen | boolean | false
8 | 1 | toy | group A | test label | 1 | toy | adminpd |
I'd like to know if there's a way to have the database auto generate and return another column that is just an identifier for the row, like so:
id |widget_id | class_id | classname | classgroup | label | seq | group | name | type | value
1 | 8 | 1 | toy | group A | test label | 1 | toy | reg | text | af
2 | 8 | 1 | toy | group A | test label | 1 | reg2 | fall | text | 25327
3 | 8 | 1 | toy | group A | test label | 1 | reg2 | pd | text | dvaa
4 | 8 | 1 | toy | group A | test label | 1 | reg2 | ext | text | 28235
5 | 8 | 1 | toy | group A | test label | 1 | reg1 | ext | text | 28230
6 | 8 | 1 | toy | group A | test label | 1 | toy | meec | text | 094F22DE501
7 | 8 | 1 | toy | group A | test label | 1 | toy | mmap | text | 0|
8 | 8 | 1 | toy | group A | test label | 1 | reg1 | fna | text | 26014
9 | 8 | 1 | toy | group A | test label | 1 | reg1 | fall | text | t-123
10 | 8 | 1 | toy | group A | test label | 1 | toy | uen | boolean | false
11 | 8 | 1 | toy | group A | test label | 1 | toy | adminpd | boolean | false
I think I can do this by selecting into a temporary table.. I haven't figured out the syntax on how to do it yet... But I'm also wondering if there's another simpler way.
Once I get the data back from the database, having this ID field makes it eaiser to manipulate.
Thanks.
You can use the row_number window function to keep track of each row number.
Like so:
create table foo
(
id serial,
val text
);
INSERT INTO foo (val)
VALUES ('One'), ('Two'), ('Three');
SELECT f.*, row_number() OVER(ORDER BY val)
FROM foo AS f
ORDER BY val;
Here's an SQL Fiddle which shows this:
http://sqlfiddle.com/#!15/0c434/2
Additional options:
You could count the result with a query of the form:
SELECT count(*)
FROM
(
SELECT *
FROM foo
);
Or you may be able to get the row count back as part of the Postgres library you're using. For example, psycopg2 (Python) and DBI (Perl) allow for this (with some caveats). The library you're using may offer something similar.