Extract fields from Postgres jsonb - postgresql

I'm trying to find an efficient way to extract specific fields from a Postgres jsonb column.
CREATE TABLE foo (
id integer,
data jsonb
)
"data" contains a row with:
{
"firstname": "bob",
"lastname": "smith,
"tags": ["tag0","tag1"]
}
I want to extract a large number of fields from the data column. This select statement works, but it's cumbersome with large numbers of fields, yields really long SQL statements, and also I don't know if it is traversing the jsonb repeatedly for each column:
SELECT data->>'firstname', data->'tags' FROM foo
I tried this:
SELECT jsonb_path_query(data, '$.[firstname,tags]') FROM foo
but it gave an error message: syntax error, unexpected '[' This syntax is, in fact, correct jsonpath per https://jsonpath.com/, but it appears that Postgres doesn't implement it.
Is there a way to extract jsonb fields efficiently, both in terms of execution speed and compactness of the SQL query command?

Yes, your query will read the complete data column for all rows of foo.
Even if you normalize the data model and turn your JSON attributes into regular columns, it will read the table row by row, but then your query becomes cheaper if you only access the first couple of columns in the table.
What you are looking for is a column store, but PostgreSQL doesn't have that built in.

Related

SQLAlchemy: Efficiently subsitute integer code for string name when inserting data

What is the most efficient way to substitute an integer key from a lookup table for a string in my input data?
For example let's say I have a table that has country names in string format, with the primary key of the lookup table as a foreign key column on a second table, "cities". I have list of tuples containing data for the "cities" table, and one of those fields is the string name of the country.
So each time I input a row for a new city, I must select the the PK from the lookup table where the "country_name" string column is equal to the input string. Then the integer PK for that country row needs to be put into the FK "country_id" column in the row being added to cities.
Is there a canonial way to do this in SQLAlchemy? The most obvious way would be to write a function that gets the appropriate PK with something like select(Country.country_id).where(.country_name == 'Ruritania').
But I wonder if SQLAlchemy has a more efficient way to do it, especially for the bulk insertion of records.
"Association Proxies" sound like what I want, but I don't understand them well enough to know how to use them in the context of bulk inserts. From what I have gathered so far, an ENUM data type would be too constraining as it cannot be updated easily, but I would consider such a solution if there is a way around that caveat.
Are there ways to make sure that values are not repeatedly read from the lookup table in a batch of operations?
I am using Postgres for my database.

Handling the output of jsonb_populate_record

I'm a real beginner when it comes to SQL and I'm currently trying to build a database using postgres. I have a lot of data I want to put into my database in JSON files, but I have trouble converting it into tables. The JSON is nested and contains many variables, but the behavior of jsonb_populate_record allows me to ignore the structure I don't want to deal with right now. So far I have:
CREATE TABLE raw (records JSONB);
COPY raw from 'home/myuser/mydocuments/mydata/data.txt'
create type jsonb_type as (time text, id numeric);
create table test as (
select jsonb_populate_record(null::jsonb_type, raw.records) from raw;
When running the select statement only (without the create table) the data looks great in the GUI I use (DBeaver). However it does not seem to be an actual table as I cannot run select statements like
select time from test;
or similar. The column in my table 'test' also is called 'jsonb_populate_record(jsonb_type)' in the GUI, so something seems to be going wrong there. I do not know how to fix it, I've read about people using lateral joins when using json_populate_record, but due to my limited SQL knowledge I can't understand or replicate what they are doing.
jsonb_populate_record() returns a single column (which is a record).
If you want to get multiple columns, you need to expand the record:
create table test
as
select (jsonb_populate_record(null::jsonb_type, raw.records)).*
from raw;
A "record" is a a data type (that's why you need create type to create one) but one that can contain multiple fields. So if you have a column in a table (or a result) that column in turn contains the fields of that record type. The * then expands the fields in that record.

How do I map tables with n columns to a database?

We are currently using PostgreSQL, now have to save some tables in a database. The tables are never updated once created, but may be filtered.
The tables are dynamic in nature, as there may be n columns,
so a table would be:
|------|--------|--------|
| NAME | DATA 1 | DATA 2 |
|------|--------|--------|
another table would be:
|------|--------|--------|--------|--------|--------|
| NAME | DATA 1 | DATA 2 | DATA 3 | DATA 4 | DATA 5 |
|------|--------|--------|--------|--------|--------|
The data is not normalized because it hurts when dealing with n rows as all rows are read all at once.
These are the solutions that I come up with,
Save the table as JSON in a JSON Type or HStore pairs.
Save the table as CSV data in a Text Field
What are the alternative methods to store the above data? Can NoSQL databases handle this data?
I see nothing in your question that would keep you from using plain tables with the according number of data columns. That's the most efficient form of storage by far. Smallest storage size, fastest queries.
Tables that are "never updated once created, but may be filtered" are hardly "dynamic". Unless you are withholding essential details that's all there is.
And unless there can be more than several 100 columns. See:
What is the maximum number of columns in a PostgreSQL select query
(But you later commented a maximum of 12, which is no problem at all.)
From what you've described, it sounds like a job for jsonb. Assuming name is unique in a certain table, I can imagine sth like this:
create table test (
tableId integer,
name text,
data jsonb,
constraint pk primary key(tableId, name)
);
insert into test values (1, 'movie1', '{"rating": 10, "name": "test"}');
insert into test values (1, 'movie2', '{"rating": 9, "name": "test2"}');
insert into test values (2, 'book1', '{"rank": 100, "name": "test", "price": 10}');
insert into test values (2, 'book2', '{"rank": 10, "name": "test", "price": 12}');
Basically the idea is to use tableId to identify each sub-table and store rows of the subtables in this one db table.
This opens some possibilities:
create a separate table to store metadata about each sub-table. For example, schema of the sub-tables could be stored here for application layer validation.
partial index on large/hot sub-tables: create index test_1_movie_name on test ((data->>'name')) where tableid = 1
Dynamic column means schema less is the option we should look for . MongoDB is preferred. Are we storing as JSON ? If so Mongo will help manipulatin data / extracting / reporting will make life easier.
If you are not familiar with NOSQL . MSSQL 2016 onwards JSON storage in column is supported as varchar(MAX). SQL Server is providing the functions to deal with JSON data. Even though its a text based index by default for nvarchar . SQL supports computed column based indexing which will help to handle the elements look in JSON. Any number of non clustered index computed column is allowed which will ease the indexing to handle JSON data.
SQL 2019 has more support for JSON

select all columns except two in q kdb historical database

In output I want to select all columns except two columns from a table in q/kdb historical database.
I tried running below query but it does not work on hdb.
delete colid,coltime from table where date=.z.d-1
but it is failing with below error
ERROR: 'par
(trying to update a physically partitioned table)
I referred https://code.kx.com/wiki/Cookbook/ProgrammingIdioms#How_do_I_select_all_the_columns_of_a_table_except_one.3F but no help.
How can we display all columns except for two in kdb historical database?
The reason you are getting par error is due to the fact that it is a partitioned table.
The error is documented here
trying to update a partitioned table
You cannot directly update, delete anything on a partitioned table ( there is a separate db maintenance script for that)
The query you have used as fix is basically selecting the data first in-memory (temporarily) and then deleting the columns, hence it is working.
delete colid,coltime from select from table where date=.z.d-1
You can try the following functional form :
c:cols[t] except `p
?[t;enlist(=;`date;2015.01.01) ;0b;c!c]
Could try a functional select:
?[table;enlist(=;`date;.z.d);0b;{x!x}cols[table]except`colid`coltime]
Here the last argument is a dictionary of column name to column title, which tells the query what to extract. Instead of deleting the columns you specified this selects all but those two, which is the same query more or less.
To see what the functional form of a query is you can run something like:
parse"select colid,coltime from table where date=.z.d"
And it will output the arguments to the functional select.
You can read more on functional selects at code.kx.com.
Only select queries work on partitioned tables, which you resolved by structuring your query where you first selected the table into memory, then deleted the columns you did not want.
If you have a large number of columns and don't want to create a bulky select query you could use a functional select.
?[table;();0b;{x!x}((cols table) except `colid`coltime)]
And show all columns except a subset of columns. The column clause expects a dictionary hence I am using the function {x!x} to convert my list to a dictionary. See more information here
https://code.kx.com/q/ref/funsql/
As nyi mentioned, if you want to permanently delete columns from an historical database you can use the deleteCol function in the dbmaint tools https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md

Redshift COPY csv array field to separate rows

I have a relatively large MongoDB collection that I'm migrating into Redshift. It's ~600mm documents, so I want the copy to be as efficient as possible.
The problem is, I have an array field in my Mongo collection, but I'd like to insert each value from the array into separate rows in Redshift.
Mongo:
{
id: 123,
names: ["market", "fashion", "food"]
}
In Redshift, I want columns for "id" and "names", where the primary key is (id, name). So I should get 3 new Redshift rows from that one mongo document.
Is it possible to do that with a Redshift COPY command? I can export my data as either a csv or json into s3, but I don't want to have to do any additional processing on the data due to how long it takes to do that many documents.
You can probably do it on COPY with triggers, but it'd be quite awkward and the performance would be miserable (since you can't just transform the row and would need to do INSERTs from the trigger function).
It's a trivial transform, though, why not just pass it through any scripting language on export?
You can also import as-is, and transform afterwards (should be pretty fast on Redshift):
CREATE TABLE mydata_load (
id int4,
names text[]
);
do the copy
CREATE TABLE mydata AS SELECT id, unnest(names) as name FROM mydata_load;
Redshift does not have support for Arrays as PostgreSQL does, so you cannot just insert the data as is.
However, MongoDB has a simple aggregate function which allows you to unwind arrays exactly as you want - by using the other columns as the keys. So I'd export the result of that into a JSON, and then store it into Redshift using JSONPaths.