Create rows from part of column names - postgresql

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T

I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here

#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

Related

How to join 2 tables on fields which have different formats?

I have 2 tables with the following structure:
Table A:
id - number
a_d - text
where A.a_d has the text format: "yyyy-mm-dd 00:00:00" (examples: 2001-08-22 00:00:00, or 2002-03-23 00:00:00)
Table B:
id - number
a_d - text
where B.a_d has the text format: "dd-month-yyyy" (example: 01-jul-2002 or 09-feb-2005)
I want to run join query on the text fields of those table.
select a.a_d
from A a
join B b
on a.a_d =?= b.a_d
I can't change or update the tables, just get data from them
How can I compare this 2 fields, if there have different format ?
Use TO_DATE to convert the text dates into bona fide dates before comparing:
SELECT a.a_d
FROM A a
INNER JOIN B b
ON a.a_d::date = TO_DATE(b.a_d, 'DD-mon-YYYY');
Note that the a_d field in table A happens to be a text timestamp which can already be directly cast to date, so we only need TO_DATE for the B table.
Ideally you should store your dates and timestamps in proper columns rather than text. Then, the join would be possible without costly conversions.

Extract all the values in jsonb into a row

I'm using postgresql 11, I have a jsonb which represent a row of that table, it's look like
{"userid":"test","rolename":"Root","loginerror":0,"email":"superadmin#ae.com",...,"thirdpartyauthenticationkey":{}}
is there any method that I could gather all the "values" of the jsonb into a string which is separated by ',' and without the keys?
The string I want to obtain with the jsonb above is like
(test, Root, 0, superadmin#ae.com, ..., {})
I need to keep the ORDER of those values as what their keys were in the jsonb. Could I do that with postgresql?
You can use the jsonb_populate_record function (assuming your json data does match the users table). This will force the text value to match the order of your users table:
Schema (PostgreSQL v13)
CREATE TABLE users (
userid text,
rolename text,
loginerror int,
email text,
thirdpartyauthenticationkey json
)
Query #1
WITH d(js) AS (
VALUES
('{"userid":"test", "rolename":"Root", "loginerror":0, "email":"superadmin#ae.com", "thirdpartyauthenticationkey":{}}'::jsonb),
('{"userid":"other", "rolename":"User", "loginerror":324, "email":"nope#ae.com", "thirdpartyauthenticationkey":{}}'::jsonb)
)
SELECT jsonb_populate_record(null::users, js),
jsonb_populate_record(null::users, js)::text AS record_as_text,
pg_typeof(jsonb_populate_record(null::users, js)::text)
FROM d
;
jsonb_populate_record
record_as_text
pg_typeof
(test,Root,0,superadmin#ae.com,{})
(test,Root,0,superadmin#ae.com,{})
text
(other,User,324,nope#ae.com,{})
(other,User,324,nope#ae.com,{})
text
Note that if you're building this string to insert it back into postgresql then you don't need to do that, since the result of jsonb_populate_record will match your table:
Query #2
WITH d(js) AS (
VALUES
('{"userid":"test", "rolename":"Root", "loginerror":0, "email":"superadmin#ae.com", "thirdpartyauthenticationkey":{}}'::jsonb),
('{"userid":"other", "rolename":"User", "loginerror":324, "email":"nope#ae.com", "thirdpartyauthenticationkey":{}}'::jsonb)
)
INSERT INTO users
SELECT (jsonb_populate_record(null::users, js)).*
FROM d;
There are no results to be displayed.
Query #3
SELECT * FROM users;
userid
rolename
loginerror
email
thirdpartyauthenticationkey
test
Root
0
superadmin#ae.com
[object Object]
other
User
324
nope#ae.com
[object Object]
View on DB Fiddle
You can use jsonb_each_text() to get a set of a text representation of the elements, string_agg() to aggregate them in a comma separated string and concat() to put that in parenthesis.
SELECT concat('(', string_agg(value, ', '), ')')
FROM jsonb_each_text('{"userid":"test","rolename":"Root","loginerror":0,"email":"superadmin#ae.com","thirdpartyauthenticationkey":{}}'::jsonb) jet (key,
value);
db<>fiddle
You didn't provide DDL and DML of a (the) table the JSON may reside in (if it does, that isn't clear from your question). The demonstration above therefore only uses the JSON you showed as a scalar. If you have indeed a table you need to CROSS JOIN LATERAL and GROUP BY some key.
Edit:
If you need to be sure the order is retained and you don't have that defined in a table's structure as #Marth's answer assumes, then you can of course extract every value manually in the order you need them.
SELECT concat('(',
concat_ws(', ',
j->>'userid',
j->>'rolename',
j->>'loginerror',
j->>'email',
j->>'thirdpartyauthenticationkey'),
')')
FROM (VALUES ('{"userid":"test","rolename":"Root","loginerror":0,"email":"superadmin#ae.com","thirdpartyauthenticationkey":{}}'::jsonb)) v (j);
db<>fiddle

Fetch rows from postgres table which contains a specific id in jsonb[] column

I have a details table with adeet column defined as jsonb[]
a sample value stored in adeet column is as below image
Sample data stored in DB :
I want to return the rows which satisfies id=26088 i.e row 1 and 3
I have tried array operations and json operations but it does'nt work as required. Any pointers
Obviously the type of the column adeet is not of type JSON/JSONB, but maybe VARCHAR and we should fix the format so as to convert into a JSONB type. I used replace() and r/ltrim() funcitons for this conversion, and preferred to derive an array in order to use jsonb_array_elements() function :
WITH t(jobid,adeet) AS
(
SELECT jobid, replace(replace(replace(adeet,'\',''),'"{','{'),'}"','}')
FROM tab
), t2 AS
(
SELECT jobid, ('['||rtrim(ltrim(adeet,'{'), '}')||']')::jsonb as adeet
FROM t
)
SELECT t.*
FROM t2 t
CROSS JOIN jsonb_array_elements(adeet) j
WHERE (j.value ->> 'id')::int = 26088
Demo
You want to combine JSONB's <# operator with the generic-array ANY construct.
select * from foobar where '{"id":26088}' <# ANY (adeet);

Postgres - use a dictionary to convert column names?

I have a table with columns FIRSTNAME LASTNAME and I want to create a third column that combines those two columns into FIRSTNAME_LASTNAME but ALSO uses a special dictionary to convert some of the names. Say I just want to apply it to the FIRSTNAME, e.g.:
Albert -> Funnyguy, Kathleen -> Nerd, Megan -> Weirdo
So the new column for the "Albert Jones" row would be "Funnyguy_Jones".
Currently I do this in psycopg2 by reading in all the rows (in batches because the db is huge), using a python dictionary to convert and create the new column, then sending out the updates with UPDATE table SET newcol = tmp.newcol FROM (VALUES ...) etc. This is very slow because of reading it into python. Any tips?
EDIT: not all of the names have conversions (only like 10% of them do, for those I want to keep the original name)
If left join has a match COALESCE will choose t2.newName, other wise you will choose t1.firstName
SELECT t1.firstName,
t1.lastName,
COALESCE(t2.newName, t1.firstName) + '_' + t1.lastName as combinedName
FROM firstTable t1
LEFT JOIN newTable t2
ON t1.firstName = t2.firstName

Postgres: buckets always filled from left in crosstab query

My query looks like this:
SELECT mthreport.*
FROM crosstab
('SELECT
to_char(ipstimestamp, ''mon DD HH24h'') As row_name,
varid::text || log.varid || ''_'' || ips.objectname::text As bucket,
COUNT(*)::integer As bucketvalue
FROM loggingdb_ips_boolean As log
INNER JOIN IpsObjects As ips
ON log.Varid=ips.ObjectId
WHERE ((log.varid = 37551)
OR (log.varid = 27087)
OR (log.varid = 50876)
OR (log.varid = 45096)
OR (log.varid = 54708)
OR (log.varid = 47475)
OR (log.varid = 54606)
OR (log.varid = 25528)
OR (log.varid = 54729))
GROUP BY to_char(ipstimestamp, ''yyyy MM DD HH24h''), row_name, objectid, bucket
ORDER BY to_char(ipstimestamp, ''yyyy MM DD HH24h''), row_name, objectid, bucket' )
As mthreport(item_name text, varid_37551 integer,
varid_27087 integer ,
varid_50876 integer ,
varid_45096 integer ,
varid_54708 integer ,
varid_47475 integer ,
varid_54606 integer ,
varid_25528 integer ,
varid_54729 integer ,
varid_29469 integer)
the query can be tested against a test table with this connection string:
"host=bellariastrasse.com port=5432 dbname=IpsLogging user=guest password=guest"
The query is syntactically correct and runs fine. My problem is that it the COUNT(*) values are always filling the leftmost column. however, in many instances the left columns should have a zero, or a NULL, and only the 2nd (or n-th) column should be filled. My brain is melting and I cannot figure out what is wrong!
The solution for your problem is to use the crosstab() variant with two parameters.
The second parameter (another query string) produces the list of output columns, so that NULL values in the data query (the first parameter) are assigned correctly.
Check the manual for the tablefunc extension, and in particular crosstab(text, text):
The main limitation of the single-parameter form of crosstab is that
it treats all values in a group alike, inserting each value into the
first available column. If you want the value columns to correspond to
specific categories of data, and some groups might not have data for
some of the categories, that doesn't work well. The two-parameter form
of crosstab handles this case by providing an explicit list of the
categories corresponding to the output columns.
Emphasis mine. I posted a couple of related answers recently here or here or here.