Parsing a database column into individual columns - postgresql

I'm trying to parse some data stored in a database column into individual columns. The data can vary in length. I want to name the database column the name of the value being parsed. e.g number=12345 the column should be called number, the value in the column should be 12345
An example of the data stored in a column:
id text
___________________________________________________________________________
1 Re: Fwd: number=12345:bottle=glass:water=sparkling:food=chocolate
2 number=223344:bottle=plastic:water=still:food=sweets:biscuit=digestive
What I would like is the following:
id Re Fwd number bottle water food biscuit
__________________________________________________________________
1 Re Fwd 12345 glass sparkling chocolate null
2 null null 223344 plastic still sweets digestive
I've tried (select string_to_array (text, ':') from my_table) but which splits the data but not how I want it.

table to insert to:
create table s201 (id int,Re text, Fwd text,number int, bottle text, water text, food text, biscuit text);
query:
with pre as (
with a(k,v) as (
values (1,'Re: Fwd: number=12345:bottle=glass:water=sparkling:food=chocolate'),(2,'number=223344:bottle=plastic:water=still:food=sweets:biscuit=digestive')
)
, col(s,n) as (select * from unnest(array['Re','Fwd','number','bottle','water','food','biscuit']) with ordinality c (s,n))
select s,n,k, case when o not like '%=%' then trim(o) else split_part(trim(o),'=',2) end c
from col
join a on true
left outer join unnest(string_to_array(a.v,':')) with ordinality t (o,i) on split_part(trim(o),'=',1) = s
)
, agg as (
select k, array_agg(c order by n) a
from pre
group by k
)
insert into s201
select k,a[1],a[2],a[3]::int,a[4],a[5],a[6],a[7]
from agg;
INSERT 0 2
checking:
select * from s201;
id | re | fwd | number | bottle | water | food | biscuit
----+----+-----+--------+---------+-----------+-----------+-----------
1 | Re | Fwd | 12345 | glass | sparkling | chocolate |
2 | | | 223344 | plastic | still | sweets | digestive
(2 rows)

Related

Postgres: Query for list of ids in a mapping table and create If they don't exist

Assume we have the following table whose purpose is to autogenerate a numeric id for distinct (name, location) tuples:
CREATE TABLE mapping
(
id bigserial PRIMARY KEY,
name text NOT NULL,
location text NOT NULL,
);
CREATE UNIQUE INDEX idx_name_loc on mapping(name location)
What is the most efficient way to query for a set of (name, location) tuples and autocreate any mappings that don't already exist, with all mappings (including the ones we created) being returned to the user.
My naive implementation would be something like:
SELECT id, name, location
FROM mappings
WHERE (name, location) IN ((name_1, location_1)...(name_n, location_n))
do something with the results in a programming language of may choice to work out which results are missing.
INSERT
INTO mappings (name, location)
VALUES (missing_name_1, missing_loc_1), ... (missing_name_2, missing_loc_2)
ON CONFLICT DO NOTHING
This gets the job done but I get the feeling there's probably something that can a) be done in pure sql and b) is more efficient.
You can use DISTINCT to get all possible values for the two columns, and CROSS JOIN to get their Carthesian product.
LEFT JOIN with the original table to get the actual records (if any):
CREATE TABLE mapping
( id bigserial PRIMARY KEY
, name text NOT NULL
, location text NOT NULL
, UNIQUE (name, location)
);
INSERT INTO mapping(name, location) VALUES ('Alice', 'kitchen'), ('Bob', 'bedroom' );
SELECT * FROM mapping;
SELECT n.name, l.location, m.id
FROM (SELECT DISTINCT name from mapping) n
CROSS JOIN (SELECT DISTINCT location from mapping) l
LEFT JOIN mapping m ON m.name = n.name AND m.location = l.location
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 2
id | name | location
----+-------+----------
1 | Alice | kitchen
2 | Bob | bedroom
(2 rows)
name | location | id
-------+----------+----
Alice | kitchen | 1
Alice | bedroom |
Bob | kitchen |
Bob | bedroom | 2
(4 rows)
And if you want to physically INSERT the missing combinations:
INSERT INTO mapping(name, location)
SELECT n.name, l.location
FROM (SELECT DISTINCT name from mapping) n
CROSS JOIN (SELECT DISTINCT location from mapping) l
WHERE NOT EXISTS(
SELECT *
FROM mapping m
WHERE m.name = n.name AND m.location = l.location
)
;
SELECT * FROM mapping;
INSERT 0 2
id | name | location
----+-------+----------
1 | Alice | kitchen
2 | Bob | bedroom
3 | Alice | bedroom
4 | Bob | kitchen
(4 rows)

Fetch records with distinct value of one column while replacing another col's value when multiple records

I have 2 tables that I need to join based on distinct rid while replacing the column value with having different values in multiple rows. Better explained with an example set below.
CREATE TABLE usr (rid INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(12) NOT NULL,
email VARCHAR(20) NOT NULL);
CREATE TABLE usr_loc
(rid INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
code CHAR NOT NULL PRIMARY KEY,
loc_id INT NOT NULL PRIMARY KEY);
INSERT INTO usr VALUES
(1,'John','john#product'),
(2,'Linda','linda#product'),
(3,'Greg','greg#product'),
(4,'Kate','kate#product'),
(5,'Johny','johny#product'),
(6,'Mary','mary#test');
INSERT INTO usr_loc VALUES
(1,'A',4532),
(1,'I',4538),
(1,'I',4545),
(2,'I',3123),
(3,'A',4512),
(3,'A',4527),
(4,'I',4567),
(4,'A',4565),
(5,'I',4512),
(6,'I',4567);
(6,'I',4569);
Required Result Set
+-----+-------+------+-----------------+
| rid | name | Code | email |
+-----+-------+------+-----------------+
| 1 | John | B | 'john#product' |
| 2 | Linda | I | 'linda#product' |
| 3 | Greg | A | 'greg#product' |
| 4 | Kate | B | 'kate#product' |
| 5 | Johny | I | 'johny#product' |
| 6 | Mary | I | 'mary#test' |
+-----+-------+------+-----------------+
I have tried some queries to join and some to count but lost with the one which exactly satisfies the whole scenario.
The query I came up with is
SELECT distinct(a.rid)as rid, a.name, a.email, 'B' as code
FROM usr
JOIN usr_loc b ON a.rid=b.rid
WHERE a.rid IN (SELECT rid FROM usr_loc GROUP BY rid HAVING COUNT(*) > 1);`
You need to group by the users and count how many occurrences you have in usr_loc. If more than a single one, then replace the code by B. See below:
select
rid,
name,
case when cnt > 1 then 'B' else min_code end as code,
email
from (
select u.rid, u.name, u.email, min(l.code) as min_code, count(*) as cnt
from usr u
join usr_loc l on l.rid = u.rid
group by u.rid, u.name, u.email
) x;
Seems to me that you are using MySQL, rather than IBM DB2. Is that so?

Postgres - updates with join gives wrong results

I'm having some hard time understanding what I'm doing wrong.
The result of this query shows the same results for each row instead of being updated by the right result.
My DATA
I'm trying to update a table of stats over a set of business
business_stats ( id SERIAL,
pk integer not null,
b_total integer,
PRIMARY KEY(pk)
);
the details of each business are stored here
business_details (id SERIAL,
category CHARACTER VARYING,
feature_a CHARACTER VARYING,
feature_b CHARACTER VARYING,
feature_c CHARACTER VARYING
);
and here a table that associate the pk with the category
datasets (id SERIAL,
pk integer not null,
category CHARACTER VARYING;
PRIMARY KEY(pk)
);
WHAT I DID (wrong)
UPDATE business_stats
SET b_total = agg.total
FROM business_stats b,
( SELECT d.pk, count(bd.id) total
FROM business_details AS bd
INNER JOIN datasets AS d
ON bd.category = d.category
GROUP BY d.pk
) agg
WHERE b.pk = agg.pk;
The result of this query is
| id | pk | b_total |
+----+----+-----------+
| 1 | 14 | 273611 |
| 2 | 15 | 273611 |
| 3 | 16 | 273611 |
| 4 | 17 | 273611 |
but if I run just the SELECT the results of each pk are completely different
| pk | agg.total |
+----+-------------+
| 14 | 273611 |
| 15 | 407802 |
| 16 | 179996 |
| 17 | 815580 |
THE QUESTION
why is this happening?
why is the WHERE clause not working?
Before writing this question I've used as reference these posts: a, b, c
Do the following (I always recommend against joins in Updates)
UPDATE business_stats bs
SET b_total =
( SELECT count(c.id) total
FROM business_details AS bd
INNER JOIN datasets AS d
ON bd.category = d.category
where d.pk=bs.pk
)
/*optional*/
where exists (SELECT *
FROM business_details AS bd
INNER JOIN datasets AS d
ON bd.category = d.category
where d.pk=bs.pk)
The issue is your FROM clause. The repeated reference to business_stats means you aren't restricting the join like you expect to. You're joining agg against the second unrelated mention of business_stats rather than the row you want to update.
Something like this is what you are after (warning not tested):
UPDATE business_stats AS b
SET b_total = agg.total
FROM
(...) agg
WHERE b.pk = agg.pk;

PostgreSQL UNION don't merge lines properly

I have 3 tables in a PostgreSQL database:
localities (loc, 12561 rows)
plants (pl, 17052 rows)
specimens or samples (esp, 9211 rows)
pl and esp each have a field loc, to specify where that tagged plant lives, or where that sample (usually a branch with leaves and flowers) came from.
I need a report of the places that have plants or samples, and the number of plants and samples in each place. The best I did up to now is the union of two subqueries, that runs very fast (33 ms to fetch 69 rows):
(select l.id,l.nome,count(pl.id) pls,null esps
from loc l
left join pl on pl.loc = l.id
where l.id in
(select distinct pl.loc
from pl
where pl.loc > 0)
group by l.id,l.nome
union
select l.id,l.nome,null pls,count(e.id) esps
from loc l
left join esp e on e.loc = l.id
where l.id in
(select distinct e.loc
from esp e
where e.loc > 0)
group by l.id,l.nome)
order by id
The point is, when the same place has both plants and samples, it becomes two distinct lines, like:
11950 | San Martin | | 5 |
11950 | San Martin | 61 | |
Of course what I want is:
11950 | San Martin | 61 | 5 |
Before that, I have tried doing all in one query:
select l.id,l.nome,count(pl.id),count(e.id) esps
from loc l
left join pl on pl.loc = l.id
left join esp e on e.loc = l.id
where l.id in
(select distinct pl.loc
from pl
where pl.loc > 0)
or l.id in
(select distinct e.loc
from esp e
where e.loc > 0)
group by l.id,l.nome
but it returns a strange repetition (it's multiplying both results and showing the result twice):
11950 | San Martin | 305 | 305 |
I have tried without subqueries, but it was taking about 13 seconds, which is too long.
I created test layout with:
create table localities (id integer, loc_name text);
create table plants (plant_id integer, loc_id integer);
create table samples (sample_id integer, loc_id integer);
insert into localities select x, ('Loc ' || x::text) from generate_series(1, 12561) x ;
insert into plants select x, (random()*12561)::integer from generate_series(1, 17052) x;
insert into samples select x, (random()*12561)::integer from generate_series(1, 9211) x;
The trick is to create an intermediate table from plants and samples but with same structure. Where data doesn't make sense (plant has no sample_id), you add null:
select loc_id, plant_id, null as sample_id from plants
union all
select loc_id, null as plant_id, sample_id from samples
This table has unified structure and you can then aggregate on it (I'm using WITH to make it a bit more readable.):
with localities_used as (
select loc_id, plant_id, null as sample_id from plants
union all
select loc_id, null as plant_id, sample_id from samples)
select
localities_used.loc_id,
count(localities_used.plant_id) plant_count,
count(localities_used.sample_id) sample_count
from
localities_used
group by
localities_used.loc_id;
If you need additional data from localities, you can join them on the aggregated table:
with localities_used as (
select loc_id, plant_id, null as sample_id from plants
union all
select loc_id, null as plant_id, sample_id from samples),
aggregated as (
select
localities_used.loc_id,
count(localities_used.plant_id) plant_count,
count(localities_used.sample_id) sample_count
from
localities_used
group by
localities_used.loc_id)
select * from aggregated left outer join localities on aggregated.loc_id = localities.id;
This takes 75ms on my laptop all together.
This should be as easy as
select * from (
select
location.*,
(select count(id) from plant where plant.location = location.id) as plants,
(select count(id) from sample where sample.location = location.id) as samples
from location
) subquery
where subquery.plants > 0 or subquery.samples > 0;
id | name | plants | samples
----+------------+--------+---------
1 | San Martin | 2 | 1
2 | Rome | 1 | 2
3 | Dallas | 3 | 1
(3 rows)
This is the database I quickly set up to experiment with:
create table location(id serial primary key, name text);
create table plant(id serial primary key, name text, location integer references location(id));
create table sample(id serial primary key, name text, location integer references location(id));
insert into location (name) values ('San Martin'), ('Rome'), ('Dallas'), ('Ghost Town');
insert into plant (name, location) values ('San Martin Dandelion', 1),('San Martin Camomile', 1), ('Rome Raspberry', 2), ('Dallas Locoweed', 3), ('Dallas Lemongrass', 3), ('Dallas Setaria', 3);
insert into sample (name, location) values ('San Martin Bramble', 1), ('Rome Iris', 2), ('Rome Eucalypt', 2), ('Dallas Dogbane', 3);
tests=# select * from location;
id | name
----+------------
1 | San Martin
2 | Rome
3 | Dallas
4 | Ghost Town
(4 rows)
tests=# select * from plant;
id | name | location
----+----------------------+----------
1 | San Martin Dandelion | 1
2 | San Martin Camomile | 1
3 | Rome Raspberry | 2
4 | Dallas Locoweed | 3
5 | Dallas Lemongrass | 3
6 | Dallas Setaria | 3
(6 rows)
tests=# select * from sample;
id | name | location
----+--------------------+----------
1 | San Martin Bramble | 1
2 | Rome Iris | 2
3 | Rome Eucalypt | 2
4 | Dallas Dogbane | 3
(4 rows)
I didn't test that but I think it could be something like this:
SELECT
l.id,
l.nome,
SUM(CASE WHEN pl.id IS NOT NULL THEN 1 ELSE 0 END) as plants_count,
SUM(CASE WHEN e.id IS NOT NULL THEN 1 ELSE 0 END) as esp_count
FROM loc l
LEFT JOIN pl ON pl.loc = l.id
LEFT JOIN esp e ON e.loc = l.id
GROUP BY l.id,l.nome
The point is to count non null ids of each type.

Need cleaner update method in PostgreSQL 9.1.3

Using PostgreSQL 9.1.3 I have a points table like so (What's the right way to show tables here??)
| Column | Type | Table Modifiers | Storage
|--------|-------------------|-----------------------------------------------------|----------|
| id | integer | not null default nextval('points_id_seq'::regclass) | plain |
| name | character varying | not null | extended |
| abbrev | character varying | not null | extended |
| amount | real | not null | plain |
In another table, orders I have a bunch of columns where the name of the column exists in the points table via the abbrev column, as well as a total_points column
| Column | Type | Table Modifiers |
|--------------|--------|--------------------|
| ud | real | not null default 0 |
| sw | real | not null default 0 |
| prrv | real | not null default 0 |
| total_points | real | default 0 |
So in orders I have the sw column, and in points I'll now have an amount that realtes to the column where abbrev = sw
I have about 15 columns like that in the points table, and now I want to set a trigger so that when I create/update an entry in the points table, I calculate a total score. Basically with just those three shown I could do it long-hand like this:
UPDATE points
SET total_points =
ud * (SELECT amount FROM points WHERE abbrev = 'ud') +
sw * (SELECT amount FROM points WHERE abbrev = 'sw') +
prrv * (SELECT amount FROM points WHERE abbrev = 'prrv')
WHERE ....
But that's just plain ugly and repetative, and like I said there are really 15 of them (right now...). I'm hoping there's a more sophisticated way to handle this.
In general each of those silly names on the orders table represents a type of work associated with the order, and each of those types has a 'cost' to it, which is stores in the points table. I'm not married to this structure if there's a cleaner setup.
"Serialize" the costs for orders:
CREATE TABLE order_cost (
order_cost_id serial PRIMARY KEY
, order_id int NOT NULL REFERENCES order
, cost_type_id int NOT NULL REFERENCES points
, cost int NOT NULL DEFAULT 0 -- in Cent
);
For a single row:
UPDATE orders o
SET total_points = COALESCE((
SELECT sum(oc.cost * p.amount) AS order_cost
FROM order_cost oc
JOIN points p ON oc.cost_type_id = p.id
WHERE oc.order_id = o.order_id
), 0);
WHERE o.order_id = $<order_id> -- your order_id here ...
Never use the lossy type real for currency data. Use exact types like money, numeric or just integer - where integer is supposed to store the amount in Cent.
More advice in this closely related example:
How to implement a many-to-many relationship in PostgreSQL?