How to work with data values formatted [{}, {}, {}] - postgresql

I apologize if this is a simple question - I had some trouble even formatting the question when I was trying to Google for help!
In one of the tables I am working with, there's data value that looks like below:
Invoice ID
Status
Product List
1234
Processed
[{"product_id":463153},{"product_id":463165},{"product_id":463177},{"pid":463218}]
I want to count how many products each order has purchased. What is the proper syntax and way to count the values under "Product List" column? I'm aware that count() is wrong, and I need to maybe extract the data from the string value.
select invoice_id, count(Product_list)
from quote_table
where status = 'processed'
group by invoice_id

You can use a JSON function named: json_array_length and cast this column like a JSON data type (as long as possible), for example:
select invoice_id, json_array_length(Product_list::json) as count
from quote_table
where status = 'processed'
group by invoice_id;
invoice_id | count
------------+-------
1234 | 4
(1 row)

If you need to count a specific property of the json column, you can use the query below.
This query solves the problem by using create type, json_populate_recordset and subquery to count product_id inside json data.
drop type if exists count_product;
create type count_product as (product_id int);
select
t.invoice_id,
t.status,
(
select count(*) from json_populate_recordset(
null::count_product,
t.Product_list
)
where product_id is not null
) as count_produto_id
from (
-- symbolic data to use in query
select
1234 as invoice_id,
'processed' as status,
'[{"product_id":463153},{"product_id":463165},{"product_id":463177},{"pid":463218}]'::json as Product_list
) as t

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

All records returned when expanding JSON with IN clause

I'm using JSON to pass in a list of ID's, however the following query always returns all records.
select first_name
from app.users
where id in (
select id::varchar::bigint
from json_array_elements('[9497902]'::json) id
);
Swapping out the JSON with the ID's manually brings back the expected amount of records:
select first_name
from app.users
where id in (
9497902
);
Is there something I am missing to be able to work with JSON ID's?
demo: db<>fiddle
select name
from users
where id in (
select id.value::varchar::bigint
from json_array_elements('[1,3]'::json) id
);
The name id used within the subquery has been transferred from the outer query into the inner query. This can be checked just by executing this query:
select name
from users
where id in (
select id
);
The alias of your JSON function is id but this is the name of the result set not of the column. The default column name of this function is value. So correctly you have to use SELECT id.value... or simply SELECT value to referrence the correct value.

Postgres query for report

I'm trying to solve this problem:
I have a query/view that will join ~10 tables to extract some fields for a report (if any). The query doesn't use any grouping function, only joins and cut off some unuseful data.
I have to take this one big view, get the group for the first index, take the max of a date in the second column and take all the information from other fields referring the record of the max value.
I cannot be able to to this in postgres.
As a pseudo code I can give this:
select 1
, max(2)
, 3 referred to the record from max(2)
, 4 referred to the record from max(2)
, ...
, 20 referred to the record from max(2)
from (ViewWithAllJoins) a
group by 1
For privacy and business problem I had to obfuscate some informations, 1/2/3/4... are the name of the column from the view "ViewWithAllJoins", I hope that the problem is still understandable and resolvable!
I've tryied with WINDOW command as reported in Convert keep dense_rank from Oracle query into postgres but I cannot be able to use the group by that I need. Other tryes that I've done was about the dense_rank like shown in Dense_rank first Oracle to Postgresql convert but I can't do any assumption on the order of the data in any of the other fields in exception of 1 and 2, so I can't use any of the aggregate function on them.
Any ideas? Possibly without adding too much subqueryes.
Thank you!
EDIT:
As suggested I'll add some synthetic data to better understand the problem and what I want.
Start:
ID DATE COLUMN1 COLUMN2 COLUMN3
=====================================================================
88888888;"2016-04-02 09:00:00";"aaaaaaaaaaa";"TEXT89" ; 999999999
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
88888888;"2017-11-09 09:00:00";"zzzz" ;"TEXT80000" ; 850580582
75858585;"2017-01-31 09:00:00";"~~~~~~~~~~~";"TEXT10" ; 101010101
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
What I want:
ID DATE COLUMN1 COLUMN2 COLUMN3
===================================================================
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
So the group by id, the max of the date and the other fields related to the row of the max date.
-- So you have duplicate records per ID, and for every ID you want to select the record with the most recent date ?
Use NOT EXISTS:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM queryview t
WHERE NOT EXISTS (
SELECT *
FROM queryview x
WHERE x.id=t.id
AND x.zdate > t.zdate
);
Or, use row_number() over a window, and pick only the row with the final date:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM ( SELECT *
, row_number() OVER(PARTITION BY id, ORDER BY zdate DESC) AS rn
FROM queryview
) q
WHERE q.rn = 1
;

Cast a PostgreSQL column to stored type

I am creating a viewer for PostgreSQL. My SQL needs to sort on the type that is normal for that column. Take for example:
Table:
CREATE TABLE contacts (id serial primary key, name varchar)
SQL:
SELECT id::text FROM contacts ORDER BY id;
Gives:
1
10
100
2
Ok, so I change the SQL to:
SELECT id::text FROM contacts ORDER BY id::regtype;
Which reults in:
1
2
10
100
Nice! But now I try:
SELECT name::text FROM contacts ORDER BY name::regtype;
Which results in:
invalid type name "my first string"
Google is no help. Any ideas? Thanks
Repeat: the error is not my problem. My problem is that I need to convert each column to text, but order by the normal type for that column.
regtype is a object identifier type and there is no reason to use it when you are not referring to system objects (types in this case).
You should cast the column to integer in the first query:
SELECT id::text
FROM contacts
ORDER BY id::integer;
You can use qualified column names in the order by clause. This will work with any sortable type of column.
SELECT id::text
FROM contacts
ORDER BY contacts.id;
So, I found two ways to accomplish this. The first is the solution #klin provided by querying the table and then constructing my own query based on the data. An untested psycopg2 example:
c = conn.cursor()
c.execute("SELECT * FROM contacts LIMIT 1")
select_sql = "SELECT "
for row in c.description:
if row.name == "my_sort_column":
if row.type_code == 23:
sort_by_sql = row.name + "::integer "
else:
sort_by_sql = row.name + "::text "
c.execute("SELECT * FROM contacts " + sort_by_sql)
A more elegant way would be like this:
SELECT id::text AS _id, name::text AS _name AS n FROM contacts ORDER BY id
This uses aliases so that ORDER BY still picks up the original data. The last option is more readable if nothing else.

Can the categories in the postgres tablefunc crosstab() function be integers?

It's all in the title. Documentation has something like this:
SELECT *
FROM crosstab('...') AS ct(row_name text, category_1 text, category_2 text);
I have two tables, lab_tests and lab_tests_results. All of the lab_tests_results rows are tied to the primary key id integer in the lab_tests table. I'm trying to make a pivot table where the lab tests (identified by an integer) are row headers and the respective results are in the table. I can't get around a syntax error at or around the integer.
Is this possible with the current set up? Am I missing something in the documentation? Or do I need to perform an inner join of sorts to make the categories strings? Or modify the lab_tests_results table to use a text identifier for the lab tests?
Thanks for the help, all. Much appreciated.
Edit: Got it figured out with the help of Dmitry. He had the data layout figured out, but I was unclear on what kind of output I needed. I was trying to get the pivot table to be based on batch_id numbers in the lab_tests_results table. Had to hammer out the base query and casting data types.
SELECT *
FROM crosstab('SELECT lab_tests_results.batch_id, lab_tests.test_name, lab_tests_results.test_result::FLOAT
FROM lab_tests_results, lab_tests
WHERE lab_tests.id=lab_tests_results.lab_test AND (lab_tests.test_name LIKE ''Test Name 1'' OR lab_tests.test_name LIKE ''Test Name 2'')
ORDER BY 1,2'
) AS final_result(batch_id VARCHAR, test_name_1 FLOAT, test_name_2 FLOAT);
This provides a pivot table from the lab_tests_results table like below:
batch_id |test_name_1 |test_name_2
---------------------------------------
batch1 | result1 | <null>
batch2 | result2 | result3
If I understand correctly your tables look something like this:
CREATE TABLE lab_tests (
id INTEGER PRIMARY KEY,
name VARCHAR(500)
);
CREATE TABLE lab_tests_results (
id INTEGER PRIMARY KEY,
lab_tests_id INTEGER REFERENCES lab_tests (id),
result TEXT
);
And your data looks something like this:
INSERT INTO lab_tests (id, name)
VALUES (1, 'test1'),
(2, 'test2');
INSERT INTO lab_tests_results (id, lab_tests_id, result)
VALUES (1,1,'result1'),
(2,1,'result2'),
(3,2,'result3'),
(4,2,'result4'),
(5,2,'result5');
First of all crosstab is part of tablefunc, you need to enable it:
CREATE EXTENSION tablefunc;
You need to run it one per database as per this answer.
The final query will look like this:
SELECT *
FROM crosstab(
'SELECT lt.name::TEXT, lt.id, ltr.result
FROM lab_tests AS lt
JOIN lab_tests_results ltr ON ltr.lab_tests_id = lt.id'
) AS ct(test_name text, result_1 text, result_2 text, result_3 text);
Explanation:
The crosstab() function takes a text of a query which should return 3 columns; (1) a column for name of a group, (2) a column for grouping, (3) the value. The wrapping query just selects all the values those crosstab() returns and defines the list of columns after (the part after AS). First is the category name (test_name) and then the values (result_1, result_2). In my query I'll get up to 3 results. If I have more then 3 results then I won't see them, If I have less then 3 results I'll get nulls.
The result for this query is:
test_name |result_1 |result_2 |result_3
---------------------------------------
test1 |result1 |result2 |<null>
test2 |result3 |result4 |result5