PostgreSQL group by bug on Unicode strings? - unicode

I have a very weird thing happening, where I noticed that a group by (word) wasn't always grouping by word if that word is a UTF-8 string. In the same query, I get cases where it's been grouped correctly, and cases where it hasn't. I wonder if anybody knows what's up with that?
select *,count(*) over (partition by md5(word)) as k
from (
select word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 1 | 2 -- what the heck?
ネガ | 1 | 2
パス | 1 | 1
*/
Note that the following workaround works fine:
select word,n,count(*) over (partition by md5(word)) as k
from (
select md5(word),max(word) as word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 2 | 1
パス | 1 | 1
プア | 1 | 1
*/
The version is PostgreSQL 8.2.14 (Greenplum Database 4.0.4.0 build 3 Single-Node Edition) on x86_64-unknown-linux-gnu, compiled by GCC gcc.exe (GCC) 4.1.1 compiled on Nov 30 2010 17:20:26.
The source table :tmpwl:
\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
Column | Type | Modifiers
----------+---------+-----------
baseword | text |
word | text |
value | integer |
lexicon | text |
nalts | bigint |
Distributed by: (word)

Related

postgresql: query two tables with same column names and show the result side by side ordered their column names, which occur in both tables

Having two tables (table1, table2) with the same column names (generation, parent), the desired output would be the combination of all columns of both tables. Thereby the rows of table2 should join table1 so that the rows of table2 are matching those of table1 on generation column. The parent number should be ordered ascending for the entries in table1 as well as in table2. The number of rows of the query results should be equal of those of table1.
Given the following tables
table1:
| generation | parent |
|:----------:|:------:|
| 0 | 1 |
| 0 | 2 |
| 0 | 3 |
| 1 | 3 |
| 1 | 2 |
| 1 | 1 |
| 2 | 2 |
| 2 | 1 |
| 2 | 3 |
table2:
| generation | parent |
|:----------:|:------:|
| 1 | 3 |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
The following queries are thought for creating and populating two sample tables as shown above:
create table table1(generation integer, parent integer);
insert into table1 (generation, parent) values(0,1),(0,2),(0,3),(1,3),(1,2),(1,1),(2,2),(2,1),(2,3);
create table table2(generation integer, parent integer);
insert into table2 (generation, parent) values(1,3),(1,1),(1,3),(2,1),(2,2),(2,3);
the imagined query should lead to the following desired result:
| table1_generation | table1_parent | table2_generation | table2_parent |
|:-----------------:|:-------------:|:-----------------:|:-------------:|
| 0 | 1 | | |
| 0 | 2 | | |
| 0 | 3 | | |
| 1 | 1 | 1 | 1 |
| 1 | 2 | 1 | 3 |
| 1 | 3 | 1 | 3 |
| 2 | 1 | 2 | 1 |
| 2 | 2 | 2 | 2 |
| 2 | 3 | 2 | 3 |
Current query looks as follows:
with
p as (
select
generation,
parent
from
table1
order by
generation,
parent
), o as(
select
generation,
parent
from
table2
order by
generation,
parent
)
select
p.generation as table1_generation,
p.parent as table1_parent,
o.generation as table2_generation,
o.parent as table2_parent
from
p
left join o on
o.generation=p.generation;
Which leads to the following result:
| table1_generation | table1_parent | table2_generation | table2_parent |
|:-----------------:|:-------------:|:-----------------:|:-------------:|
| 0 | 1 | | |
| 0 | 2 | | |
| 0 | 3 | | |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 1 | 3 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 1 | 3 |
| 1 | 3 | 1 | 1 |
| 1 | 3 | 1 | 3 |
| 1 | 3 | 1 | 3 |
| 2 | 1 | 2 | 1 |
| 2 | 1 | 2 | 2 |
| 2 | 1 | 2 | 3 |
| 2 | 2 | 2 | 1 |
| 2 | 2 | 2 | 2 |
| 2 | 2 | 2 | 3 |
| 2 | 3 | 2 | 1 |
| 2 | 3 | 2 | 2 |
| 2 | 3 | 2 | 3 |
This link led to the conclusion, that any join command might not what is necessary here ... But union does only append rows... so for me it is absolutely unclear, how the desired result can be achieved o.O
Any help is highly appreciated. Thanks in advance!
The main misunderstanding on this question arose from the fact that you mentioned join, which is a very precisely mathematically defined concept based on the Cartesian product and can be applied to any two sets. So the current output is clear.
But as you wrote in the title, you want to put two tables side by side. You take advantage of the fact that they have the same number of rows (triples).
This select returns the output you want.
I made artificial join columns, row_number() OVER (order by generation, parent) as rnum, and moved the second table using the addition of three. I hope this helps you:
with
p as (
select
row_number() OVER (order by generation, parent) as rnum,
generation,
parent
from
table1
order by
generation,
parent
), o as(
select
row_number() OVER (order by generation, parent) as rnum,
generation,
parent
from
table2
order by
generation,
parent
)
select
p.generation as table1_generation,
p.parent as table1_parent,
o.generation as table2_generation,
o.parent as table2_parent
from
p
left join o on
o.rnum+3=p.rnum
order by 1,2,3,4;
Output:
table1_generation
table1_parent
table2_generation
table2_parent
0
1
(null)
(null)
0
2
(null)
(null)
0
3
(null)
(null)
1
1
1
1
1
2
1
3
1
3
1
3
2
1
2
1
2
2
2
2
2
3
2
3

PostgreSQL 9.3:Updating table(order column) from another table, getting same values in rows

I need help with updating table from another table in Postgres Db.
Long story short we ended up with corrupted data in db, and now I need to update one table with values from another.
I have table with this data table wfc:
| step_id | command_id | commands_order |
|---------|------------|----------------|
| 1 | 1 | 0 |
| 1 | 2 | 1 |
| 1 | 3 | 2 |
| 1 | 4 | 3 |
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 2 | 3 | 1 |
| 2 | 3 | 1 |
| 2 | 4 | 3 |
and I want to update values in command_order column from another table, so I can have result like this:
| step_id | command_id | commands_order|
|---------|------------|---------------|
| 1 | 1 | 0 |
| 1 | 2 | 1 |
| 1 | 3 | 2 |
| 1 | 4 | 3 |
| 1 | 1 | 4 |
| 2 | 2 | 0 |
| 2 | 3 | 1 |
| 2 | 3 | 2 |
| 2 | 4 | 3 |
It was looking like easy task, but problem is to update rows for same command_id, it is writing same value in commands_order
SQL that I tried is:
UPDATE wfc
SET commands_order = CAST(sq.input_step_id as INTEGER)
FROM (
SELECT wfp.step_id, wfp.input_command_id, wfp.input_step_id
from wfp
order by wfp.step_id, wfp.input_step_id
) AS sq
WHERE (wfc.step_id=sq.step_id AND wfc.command_id=CAST(sq.input_command_id as INTEGER));
SQL Fiddle http://sqlfiddle.com/#!15/4efff4/4
I am pretty stuck with this, please help.
Thanks in advance.
Assuming you are trying to number the rows in the order in which they were created, and as long as you understand that ctid will chnage on update and with VACCUUM FULL, you can do the following:
select step_id, command_id, rank - 1 as command_order
from (select step_id, command_id, ctid as wfc_ctid, rank() over
(partition by step_id order by ctid)
from wfc) as wfc_ordered;
This will give you the wfc table with the ordering that you want. If you do update the original table, the ctids will change, so it's probably safer to create a copy of the table with the above query.

Setting muliple rows in postgres based on the set values of previous postgres rows

I'm running postgres 9.4
I'm essentially updating an existing unorganized structure to a folder based organization. Im auto-assigning an order number to each item for user reordering, but doing an initial setting of all of these values with a 1 time use update statement. However, It seems like SET is taking my subquery's from clause and not recreating it for each successive row that it sets.
Here's my query example:
UPDATE folder_items
SET order_number =
(SELECT COALESCE(MAX(folder_items_2.order_number), 0) + 1
FROM folder_items AS folder_items_2
WHERE folder_items.parent_folder_id = folder_items_2.parent_folder_id
AND folder_items.folder_set_id = folder_items_2.folder_set_id
AND folder_items.id != folder_items_2.id);
With my initial table:
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | null
row 2 | 2 | 1 | null
row 3 | 3 | 2 | null
row 4 | 4 | 2 | null
row 5 | 5 | 2 | null
row 6 | 6 | 3 | null
when I run my query I get something like
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | 1
row 2 | 2 | 1 | 1
row 3 | 3 | 2 | 1
row 4 | 4 | 2 | 1
row 5 | 5 | 2 | 1
row 6 | 6 | 3 | 1
However, I want results that look like this:
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | 1
row 2 | 2 | 1 | 2
row 3 | 3 | 2 | 1
row 4 | 4 | 2 | 2
row 5 | 5 | 2 | 3
row 6 | 6 | 3 | 1
Is there a way to get these desired results? Is the best way to do some sort of window function that counts how many in the same folder_set_id are underneath each row?
Use ROW_NUMBER to calculate the ORDER_ID, then update the table.
with new_order as (
SELECT "folder_id",
row_number() over ( partition by "folder_set_id"
order by "folder_id") as rn
FROM Table1
)
UPDATE Table1 AS t
SET "order_number" = n.rn
FROM new_order AS n
WHERE t."folder_id" = n."folder_id";
SQL DEMO
OUTPUT
| row_id | folder_id | folder_set_id | order_number |
|--------|-----------|---------------|--------------|
| row 1 | 1 | 1 | 1 |
| row 2 | 2 | 1 | 2 |
| row 3 | 3 | 2 | 1 |
| row 4 | 4 | 2 | 2 |
| row 5 | 5 | 2 | 3 |
| row 6 | 6 | 3 | 1 |

PostgreSQL Join Two Tables by Nearest Date

I have a large single table of sent emails with dates and outcomes and I'd like to be able to match each row with the last time that email was sent and a specific outcome occurred (here that open=1). This needs to be done with PostgreSQL. For example:
Initial table:
id | sent_dt | bounced | open ` | clicked | unsubscribe
1 | 2015-01-01 | 1 | 0 | 0 | 0
1 | 2015-01-02 | 0 | 1 | 1 | 0
1 | 2015-01-03 | 0 | 1 | 1 | 0
2 | 2015-01-01 | 0 | 1 | 0 | 0
2 | 2015-01-02 | 1 | 0 | 0 | 0
2 | 2015-01-03 | 0 | 1 | 0 | 0
2 | 2015-01-04 | 0 | 1 | 0 | 1
Result table:
id | sent_dt | bounced| open | clicked | unsubscribe| previous_time
1 | 2015-01-01 | 1 | 0 | 0 | 0 | NULL
1 | 2015-01-02 | 0 | 1 | 1 | 0 | NULL
1 | 2015-01-03 | 0 | 1 | 1 | 0 | 2015-01-02
2 | 2015-01-01 | 0 | 1 | 0 | 0 | NULL
2 | 2015-01-02 | 1 | 0 | 0 | 0 | 2015-01-01
2 | 2015-01-03 | 0 | 1 | 0 | 0 | 2015-01-01
2 | 2015-01-04 | 0 | 1 | 0 | 1 | 2015-01-03
I have tried using Lag but I don't know how to go about that with the conditional that open needs to equal 1 while still returning all rows. I also tried doing a many to many Join on id then finding the minimum Datediff but that is going to essentially square the size of my table and takes entirely too long to compute (>7hrs). There are several answers which would work for SQL but none that I see work for PostgreSQL.
Thanks for any help guys!
You can use ROW_NUMBER() to achieve this desired result, connect each one to the one that occurred before if it has open = 1.
SELECT t.*,s.sent_dt
FROM
(SELECT p.*,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY sent_dt DESC) rnk
FROM YourTable p) t
LEFT OUTER JOIN
(SELECT p.*,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY sent_dt DESC) rnk
FROM YourTable p) s
ON(t.rnk = s.rnk-1 AND s.open = 1)
First I create a cte openFilter for the dates where the mail are open.
Then I join the table mail with those filter and get the dates previous to that email. Finally filter everyone execpt the latest open mail.
SQL Fiddle Demo
WITH openFilter as (
SELECT m."id", m."sent_dt"
FROM mail m
WHERE "open" = 1
)
SELECT m."id",
to_char(m."sent_dt", 'YYYY-MM-DD'),
"bounced", "open", "clicked", "unsubscribe",
to_char(o."sent_dt", 'YYYY-MM-DD') previous_time
FROM mail m
LEFT JOIN openFilter o
ON m."id" = o."id"
AND m."sent_dt" > o."sent_dt"
WHERE o."sent_dt" = (SELECT MAX(t."sent_dt")
FROM openFilter t
WHERE t."id" = m."id"
AND t."sent_dt" < m."sent_dt")
OR o."sent_dt" IS NULL
Output
| id | to_char | bounced | open | clicked | unsubscribe | previous_time |
|----|------------|---------|------|---------|-------------|---------------|
| 1 | 2015-01-01 | 1 | 0 | 0 | 0 | (null) |
| 1 | 2015-01-02 | 0 | 1 | 1 | 0 | (null) |
| 1 | 2015-01-03 | 0 | 1 | 1 | 0 | 2015-01-02 |
| 2 | 2015-01-01 | 0 | 1 | 0 | 0 | (null) |
| 2 | 2015-01-02 | 1 | 0 | 0 | 0 | 2015-01-01 |
| 2 | 2015-01-03 | 0 | 1 | 0 | 0 | 2015-01-01 |
| 2 | 2015-01-04 | 0 | 1 | 0 | 1 | 2015-01-03 |

Count occurrences of value in field for a particular ID using Redshift

I want to count the occurrences of particular values in a certain field for an ID. So what I have is this:
| Location ID | Group |
|:----------- |:---------|
| 1 | Group A |
| 2 | Group B |
| 3 | Group C |
| 4 | Group A |
| 4 | Group B |
| 4 | Group C |
| 3 | Group A |
| 2 | Group B |
| 1 | Group C |
| 2 | Group A |
And what I would hope to yield through some computer magic is this:
| Location ID | Group A Count | Group B Count | Group C count|
|:----------- |:--------------|:--------------|:-------------|
| 1 | 1 | 0 | 1 |
| 2 | 1 | 2 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 1 |
Is there some sort of pivoting function I can use in Redshift to achieve this?
This will require the usage of the CASE function and GROUP clause, as in example.
SELECT l_id,
SUM(CASE WHEN l_group = 'Group A' THEN 1 ELSE 0 END) AS a,
SUM(CASE WHEN l_group = 'Group B' THEN 1 ELSE 0 END) AS b-- and so on
FROM location
GROUP BY l_id;
This should give you such result:
| l_id | a | b |
|------|---|---|
| 4 | 1 | 1 |
| 1 | 1 | 0 |
| 3 | 1 | 0 |
| 2 | 1 | 2 |
You can play with it on this SQL Fiddle.