How do I map tables with n columns to a database? - postgresql

We are currently using PostgreSQL, now have to save some tables in a database. The tables are never updated once created, but may be filtered.
The tables are dynamic in nature, as there may be n columns,
so a table would be:
|------|--------|--------|
| NAME | DATA 1 | DATA 2 |
|------|--------|--------|
another table would be:
|------|--------|--------|--------|--------|--------|
| NAME | DATA 1 | DATA 2 | DATA 3 | DATA 4 | DATA 5 |
|------|--------|--------|--------|--------|--------|
The data is not normalized because it hurts when dealing with n rows as all rows are read all at once.
These are the solutions that I come up with,
Save the table as JSON in a JSON Type or HStore pairs.
Save the table as CSV data in a Text Field
What are the alternative methods to store the above data? Can NoSQL databases handle this data?

I see nothing in your question that would keep you from using plain tables with the according number of data columns. That's the most efficient form of storage by far. Smallest storage size, fastest queries.
Tables that are "never updated once created, but may be filtered" are hardly "dynamic". Unless you are withholding essential details that's all there is.
And unless there can be more than several 100 columns. See:
What is the maximum number of columns in a PostgreSQL select query
(But you later commented a maximum of 12, which is no problem at all.)

From what you've described, it sounds like a job for jsonb. Assuming name is unique in a certain table, I can imagine sth like this:
create table test (
tableId integer,
name text,
data jsonb,
constraint pk primary key(tableId, name)
);
insert into test values (1, 'movie1', '{"rating": 10, "name": "test"}');
insert into test values (1, 'movie2', '{"rating": 9, "name": "test2"}');
insert into test values (2, 'book1', '{"rank": 100, "name": "test", "price": 10}');
insert into test values (2, 'book2', '{"rank": 10, "name": "test", "price": 12}');
Basically the idea is to use tableId to identify each sub-table and store rows of the subtables in this one db table.
This opens some possibilities:
create a separate table to store metadata about each sub-table. For example, schema of the sub-tables could be stored here for application layer validation.
partial index on large/hot sub-tables: create index test_1_movie_name on test ((data->>'name')) where tableid = 1

Dynamic column means schema less is the option we should look for . MongoDB is preferred. Are we storing as JSON ? If so Mongo will help manipulatin data / extracting / reporting will make life easier.
If you are not familiar with NOSQL . MSSQL 2016 onwards JSON storage in column is supported as varchar(MAX). SQL Server is providing the functions to deal with JSON data. Even though its a text based index by default for nvarchar . SQL supports computed column based indexing which will help to handle the elements look in JSON. Any number of non clustered index computed column is allowed which will ease the indexing to handle JSON data.
SQL 2019 has more support for JSON

Related

Extract fields from Postgres jsonb

I'm trying to find an efficient way to extract specific fields from a Postgres jsonb column.
CREATE TABLE foo (
id integer,
data jsonb
)
"data" contains a row with:
{
"firstname": "bob",
"lastname": "smith,
"tags": ["tag0","tag1"]
}
I want to extract a large number of fields from the data column. This select statement works, but it's cumbersome with large numbers of fields, yields really long SQL statements, and also I don't know if it is traversing the jsonb repeatedly for each column:
SELECT data->>'firstname', data->'tags' FROM foo
I tried this:
SELECT jsonb_path_query(data, '$.[firstname,tags]') FROM foo
but it gave an error message: syntax error, unexpected '[' This syntax is, in fact, correct jsonpath per https://jsonpath.com/, but it appears that Postgres doesn't implement it.
Is there a way to extract jsonb fields efficiently, both in terms of execution speed and compactness of the SQL query command?
Yes, your query will read the complete data column for all rows of foo.
Even if you normalize the data model and turn your JSON attributes into regular columns, it will read the table row by row, but then your query becomes cheaper if you only access the first couple of columns in the table.
What you are looking for is a column store, but PostgreSQL doesn't have that built in.

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

SQL Server DB Design - Single table with 150 Columns in one table or dynamic Pivot

I'm recreating a DB and I have a table with 150 columns and it has 700 rows currently (small dataset) - It will likely take 10 more years to get to 1000 rows.
My question:
Most of my data is normalized. About 125 fields contain a single numeric value (hours, currency, decimals, and integers). There are 10 or so columns that can have multiple values.
Do I continue to use the single table with 150 Rows?
Or
Do I create cross-reference tables and use a pivot query to turn my rows into columns? Something like this:
**c_FieldNames** **cx_FieldValues** **Project**
id int identity (PK) id int identity(1,1) ProjID int (PK)
fkProjectID int ProjectName
FieldName nvarchar FieldNameID int (FK to id from c_fieldNames)
Decimals nvarchar(2) FieldValue numeric(16,2)
The decimals would tell me how many decimal places a given field would need - I'd like to incorporate that into my query... Not sure if that's possible.
For each of my 125 fields with numbers, I would create a row in the cx_FieldNames table which would get an ID. That ID would be used in the FieldNameID as a foreign key.
I would then create a view a pivot table that would create a table of the 125 rows dynamically in addition to my standard table or so rows to look like the table with 150 columns.
I'm pretty sure I will be able to use a pivot table to turn my rows into columns. (Dynamically display rows as columns)
Benefits:
I could create a table for reports that would have all the "columns" I need for that report and then filter to them and just pull those fields dynamically.
Reports
ReportID int
FieldID int
The fieldID's would be based on the c_FieldName id's and I could turn all required field names (that are in the rows) into headers and run a vast majority of reports based on dynamic sql generated based on the field names. Same applies to all data structured... [Edit from Author] The more I think about this, I could do this with either table structure, which negates the benefits I saw here, as I am adding complexity for no good reason, as pointed out in the comments.
My thought is that it will same me much development time as I can use a pivot table to generate reports and pull data on the fly without much trouble. Updating data will be a bit of a chore, but not that much more than normal. I am creating a C#.NET Website with Visual Studio (hosted on Azure) to allow users to view, update, run reports on the data. Any major drawbacks in this structure? Is this a good idea? Are 125 columns in a Pivot too many? Thanks in Advance!

Redshift COPY csv array field to separate rows

I have a relatively large MongoDB collection that I'm migrating into Redshift. It's ~600mm documents, so I want the copy to be as efficient as possible.
The problem is, I have an array field in my Mongo collection, but I'd like to insert each value from the array into separate rows in Redshift.
Mongo:
{
id: 123,
names: ["market", "fashion", "food"]
}
In Redshift, I want columns for "id" and "names", where the primary key is (id, name). So I should get 3 new Redshift rows from that one mongo document.
Is it possible to do that with a Redshift COPY command? I can export my data as either a csv or json into s3, but I don't want to have to do any additional processing on the data due to how long it takes to do that many documents.
You can probably do it on COPY with triggers, but it'd be quite awkward and the performance would be miserable (since you can't just transform the row and would need to do INSERTs from the trigger function).
It's a trivial transform, though, why not just pass it through any scripting language on export?
You can also import as-is, and transform afterwards (should be pretty fast on Redshift):
CREATE TABLE mydata_load (
id int4,
names text[]
);
do the copy
CREATE TABLE mydata AS SELECT id, unnest(names) as name FROM mydata_load;
Redshift does not have support for Arrays as PostgreSQL does, so you cannot just insert the data as is.
However, MongoDB has a simple aggregate function which allows you to unwind arrays exactly as you want - by using the other columns as the keys. So I'd export the result of that into a JSON, and then store it into Redshift using JSONPaths.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.