How to query nested JSONB format data column in PostgreSQL? - postgresql

I have data gene expression data in jsosnb column in multiple rows for different samples as shown below:
Sample Gexp
Sample A {"data": [{"pval": 0.0154, "Protein": "A0A0B4J2D5", "FoldChange": 1.3534, "MinusLog10p": 0.1334, "Significance": "Non-significant"}, {"pval": 0.0689, "Protein": "A0FGR8", "FoldChange": 2.5448, "MinusLog10p": 1.1615, "Significance": "Significant"}]}
Sample B {"data": [{"pval": 0.0824, "Protein": "A0A0B4J2D5", "FoldChange": -0.1676, "MinusLog10p": 0.1084, "Significance": "Non-significant"}, {"pval": 0.0219, "Protein": "A0FGR8", "FoldChange": 2.3448, "MinusLog10p": 1.1615, "Significance": "Significant"}]}
I need to query across the column containing multiple records where a certain protein has a pval or FoldChange in a certain range. I tried multiple solutions provided in this forum (Search in nested Postgresql JSONB column, Postgresql query for objects in nested JSONB field, Query simplified JSONB form JSONB column containing nested JSON from a Postgresql database?, How to query nested array with heterogeneous elements in PostgreSQL JSONB column, etc., with no luck. Can someone help me?

The conditions for selecting the data were not precisely (unambiguously) described in the question. Exemplary, when we are looking for the A0FGR8 protein in the pval range of 0.02 to 0.03, the query might look like this:
select sample, value
from my_table
cross join jsonb_array_elements(gexp->'data')
where value->>'Protein' = 'A0FGR8'
and (value->>'pval')::numeric between 0.02 and 0.03
Test the query in Db<>fiddle.

Related

Query Importrange with concat - adding 2 of the returned columns together

I am trying to import specified data using query importrange but at the same time I want to reduce need for additional calculation columns and by using concat or something similar to add 2 columns together with a space in between ie. first name 'bob' last name 'smith' returns 'bob smith' in 1 column
=QUERY({IMPORTRANGE("https://docs.google.com/spreadsheets/d/1oaZP3-p1cI4d1QyLQ2qM5sMwnVGz8S0bhe29W4QqH6g/edit#gid=1908577977","Sheet7!A2:c"),"select Col1&" "&Col2,Col3",0})
I've tried the above but it returns formula parse error
https://docs.google.com/spreadsheets/d/1oaZP3-p1cI4d1QyLQ2qM5sMwnVGz8S0bhe29W4QqH6g/edit?usp=sharing
in post-IMPORTRANGE you can join two columns only like this:
=FLATTEN(QUERY(TRANSPOSE(QUERY(
IMPORTRANGE("13Ptmj3sejlOADvwhgfBPxRy_H-RGCxLX4r2jecbceIE", "Sheet7!A2:C"),
"select Col1,Col2", )),,9^9))
so for 3 columns:
={FLATTEN(QUERY(TRANSPOSE(QUERY(
IMPORTRANGE("13Ptmj3sejlOADvwhgfBPxRy_H-RGCxLX4r2jecbceIE", "Sheet7!A2:C"),
"select Col1,Col2", )),,9^9)),
IMPORTRANGE("13Ptmj3sejlOADvwhgfBPxRy_H-RGCxLX4r2jecbceIE", "Sheet7!C2:C")}

Convert jsonb column to a user-defined type

I'm trying to convert each row in a jsonb column to a type that I've defined, and I can't quite seem to get there.
I have an app that scrapes articles from The Guardian Open Platform and dumps the responses (as jsonb) in an ingestion table, into a column called 'body'. Other columns are a sequential ID, and a timestamp extracted from the response payload that helps my app only scrape new data.
I'd like to move the response dump data into a properly-defined table, and as I know the schema of the response, I've defined a type (my_type).
I've been referring to the 9.16. JSON Functions and Operators in the Postgres docs. I can get a single record as my type:
select * from jsonb_populate_record(null::my_type, (select body from data_ingestion limit 1));
produces
id
type
sectionId
...
example_id
example_type
example_section_id
...
(abbreviated for concision)
If I remove the limit, I get an error, which makes sense: the subquery would be providing multiple rows to jsonb_populate_record which only expects one.
I can get it to do multiple rows, but the result isn't broken into columns:
select jsonb_populate_record(null::my_type, body) from reviews_ingestion limit 3;
produces:
jsonb_populate_record
(example_id_1,example_type_1,example_section_id_1,...)
(example_id_2,example_type_2,example_section_id_2,...)
(example_id_3,example_type_3,example_section_id_3,...)
This is a bit odd, I would have expected to see column names; this after all is the point of providing the type.
I'm aware I can do this by using Postgres JSON querying functionality, e.g.
select
body -> 'id' as id,
body -> 'type' as type,
body -> 'sectionId' as section_id,
...
from reviews_ingestion;
This works but it seems quite inelegant. Plus I lose datatypes.
I've also considered aggregating all rows in the body column into a JSON array, so as to be able to supply this to jsonb_populate_recordset but this seems a bit of a silly approach, and unlikely to be performant.
Is there a way to achieve what I want, using Postgres functions?
Maybe you need this - to break my_type record into columns:
select (jsonb_populate_record(null::my_type, body)).*
from reviews_ingestion
limit 3;
-- or whatever other query clauses here
i.e. select all from these my_type records. All column names and types are in place.
Here is an illustration. My custom type is delmet and CTO t remotely mimics data_ingestion.
create type delmet as (x integer, y text, z boolean);
with t(i, j, k) as
(
values
(1, '{"x":10, "y":"Nope", "z":true}'::jsonb, 'cats'),
(2, '{"x":11, "y":"Yep", "z":false}', 'dogs'),
(3, '{"x":12, "y":null, "z":true}', 'parrots')
)
select i, (jsonb_populate_record(null::delmet, j)).*, k
from t;
Result:
i
x
y
z
k
1
10
Nope
true
cats
2
11
Yep
false
dogs
3
12
true
parrots

postgres remove specific element from jsonb array

I am using postgres 10
I have a JsonArray in a jsonb column named boards.
I have a GIN index on the jsonb column.
The column values look like this:
[{"id": "7beacefa-9ac8-4fc6-9ee6-8ff6ab1a097f"},
{"id": "1bc91c1c-b023-4338-bc68-026d86b0a140"}]
I want to delete in all the rows in the column the element
{"id": "7beacefa-9ac8-4fc6-9ee6-8ff6ab1a097f"} if such exists(update the column).
I saw that it is possible to delete an element by position with operator #- (e.g. #-'{1}') and I know you can get the position of an element using "with ordinality" but i cant manage to combine the two things.
How can i update the jsonarray?
One option would be using an update statement containing a query selecting all the sub-elements except {"id": "7beacefa-9ac8-4fc6-9ee6-8ff6ab1a097f"} by using an inequality, and then applying jsonb_agg() function to aggregate those sub-elements :
UPDATE user_boards
SET boards = (SELECT jsonb_agg(j.elm)
FROM user_boards u
CROSS JOIN jsonb_array_elements(boards) j(elm)
WHERE j.elm->>'id' != '7beacefa-9ac8-4fc6-9ee6-8ff6ab1a097f'
AND u.ID = user_boards.ID
GROUP BY ID)
where ID is an assumed identity(unique) column of the table.
Demo

Deep search within jsonb field PostgreSQL

A sample of my data looks something like this:
{"city": "NY",
"skills": [
{"soft_skills": "Analysis"},
{"soft_skills": "Procrastination"},
{"soft_skills": "Presentation"}
],
"areas_of_training": [
{"areas of training": "Visio"},
{"areas of training": "Office"},
{"areas of training": "Risk Assesment"}
]}
I would like to run a query to find users with soft_skills Analysis and maybe run another one to find users whose area of training is Visio and Risk Assesment
My column type is jsonb. How can I implement a search query on these deeply nested objects? A query on level one for city works using SELECT * FROM mydata WHERE content::json->>'city'='NY';
How can I also run a match using the LIKE keyword or string matching for deeply nested values?
1)
SELECT * FROM mydata
WHERE content->'skills' #> '[{"soft_skills": "Analysis"}]';
2)
SELECT * FROM mydata
WHERE content->'areas_of_training' #> '[{"areas of training": "Visio"},{"areas of training": "Risk Assesment"}]';
About JSON(B) operators
PS: And be ready for extremely slow queries. I highly recommend to think about data normalization.
Update for LIKE
For your example data it could be:
SELECT * FROM mydata
WHERE EXISTS (
SELECT *
FROM jsonb_array_elements(content->'areas_of_training') as a
WHERE a->>'areas of training' ilike '%vi%');
But query highly depending on the actual JSON structure.
Use json_array_elements() to get values of nested elements, examples:
select d.*
from mydata d,
json_array_elements(content->'skills')
where value->>'soft_skills' ilike '%analysis%';
select d.*
from mydata d,
json_array_elements(content->'areas_of_training')
where value->>'areas of training' ~* 'visio|office';
It is possible that the query yields duplicate rows, so it is reasonable to use select distinct on (id), where id is a primary key.
Note that the function json_array_elements() is costly and you cannot use indexes in contrary to Abelisto's solution. However you have to use it if you want to have an access to values of nested json elements.

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)