create JSONB array grouped from column values with incrementing integers - postgresql

For a PostgreSQL table, suppose the following data is in table A:
key_path | key | value
--------------------------------------
foo[1]__scrog | scrog | apple
foo[2]__scrog | scrog | orange
bar | bar | peach
baz[1]__biscuit | biscuit | watermelon
The goal is to group data when there is an incrementing number present for an otherwise identical value for column key_path.
For context, key_path is a JSON key path and key is the leaf key. The desired outcome would be:
key_path_group | key | values
------------------------------------------------------------
[foo[1]__scrog, foo[2]__scrog] | scrog | [apple, orange]
bar | bar | peach
[baz[1]__biscuit] | biscuit | [watermelon]
Also noting that for key_path=baz[1]__biscuit even though there is only a single incrementing value, it still triggers casting to an array of length 1.
Any tips or suggestions much appreciated!

May have answered my own question (sometimes just typing it out helps). The following gets very close, if not exactly, what I'm looking for:
select
regexp_replace(key_path, '(.*)\[(\d+)\](.*)', '\1[x]\3') as key_path_group,
key,
jsonb_agg(value) as values
from A
group by gp_key_path, key;

Related

Excel: Select the newest date from a list that contains multiple rows with the same ID

In Excel, I have a list with multiple rows of the same ID (column A), each with various dates recorded (Column B). I need to extract one row for each ID that contains the newest date. See below for example:
|Column A | Column B|
|(ID) | (Date) |
|-----------|-----------|
|00001 | 01/01/2022|
|00001 | 02/01/2022|
|00001 | 03/01/2022| <-- I Need this one
|00002 | 01/02/2022|
|00002 | 02/02/2022|
|00002 | 03/02/2022| <-- I Need this one
|00003 | 01/03/2022|
|00003 | 02/03/2022|
|00003 | 03/03/2022| <-- I Need this one
|00004 | 01/04/2022|
|00004 | 02/04/2022|
|00004 | 03/04/2022| <-- I Need this one
|00005 | 01/05/2022|
|00005 | 02/05/2022|
|00005 | 03/05/2022| <-- I Need this one
I need to extract the above rows, where the row with the newest date is extracted for each unique ID. It needs to look like this:
|Column A | Column B |
|(ID) | (Date) |
|----------|--------------|
|00001 | 03/01/2022 |
|00002 | 03/02/2022 |
|00003 |03/03/2022 |
|00004 | 03/04/2022 |
|00005 | 03/05/2022 |
I'm totally stumped and I can't seem to find the right answer (probably because of how I'm wording the question!)
Thank you!
Google searches for the answer - no joy. I don't know where to start in excel with this function, I thought perhaps DISTINCT or similar...
Assuming you have Office 365 compatible version of Excel, you could do something like this:
(screenshot/here refers):
=INDEX(SORTBY(A2:B11,B2#,-1),SEQUENCE(1,1,1,1),SEQUENCE(1,2,1,1))
This formula is superfluous albeit convenient - you don't really require the first sequence (there's only one row being returned). However, as you can see in the screenshot, using the self-same formula, this time with a leading 2 in the first argument of that sequence returns the top two (descending order) dates, and so forth.
FOR THOSE w/ Office 365 you could do something like this....
=LARGE(B2#+(ROW(B2#)-ROW(B2))/1000,1)
i.e. adding a "little bit" to the dates that we can subtract later and use as a unique reference (row number, original unsorted list)
As mentioned, reverse engineer, throw into an index, and voila!
=INDEX(A2:A11,ROUND((H2-ROUND(H2,0))*1000,6))
caveats:
the round(<>,6) is purely to eliminate Excel's irritating lack of precision issue.
can work if you're looking up text strings (i.e. attempting to sort alphabetically) EXCEPT large doesn't work with string (no prob, just use unicode - but good luck with expanding out the string etc. ☺ with mid(<>,row(a1:offset(a1,len(<>)-1)..,1)..

Postgres database: how to model multiple attributes that can have also multiple value, and have relations to other two entities

I have three entities, Items, Categories, and Attributes.
An Item can be in one or multiple Categories, so there is N:M relation.
Item ItemCategories Categories
id name item_id category_id id name
1 alfa 1 1 1 chipset
1 2 2 interface
An Item can have multiple Attributes depending on the 'Categories' they are in.
For example, the items in Category 'chipset' can have as attributes: 'interface', 'memory' 'tech'.
These attributes have a set of predefined values that don't change often, but they can change.
For example: 'memory' can only be ddr2, ddr3, ddr4.
Attributes CategoryAttributes
id name values category_id attribute_id
1 memory {ddr2, ddr3, ddr4} 1 1
An Item that is in the 'chipset' Category has access to the Attribute and can only have Null or the predefined value of the attribute.
I thought to use Enum or Json for Attribute values, but I have two other conditions:
ItemAttributes
item_id attribute_id value
1 1 {ddr2, ddr4}
1) If an Attribute appears in 2 Categories, and an Ithe is in both categories, only once an attribute can be shown.
2) I need to use the value with rank, so if two corresponding attribute values appear for an item, the rank should be greater if it is only one, or the value doesn't exist.
3)Creating separate tables for Attributes is not an option, because the number is not fixed, and can be big.
So, I don't know exactly the best options in the database design are to constrain the values and use for order ranking.
The problem you are describing is a typical open schema or vertical database, which is a classic use case for some kind of EAV model.
EAV is a complex yet powerful paradigm that allows a potentially open schema while respecting the database normal forms and allows to have what you need: having a variable number of attributes depending on specific instances of the same entity.
That is what happens typically in e-commerce using relational database since different products have different attributes (i.e a lipstick has color, but maybe for a hard drive you dont care about color but about capacity) and it doesn't make sense to have one attribute table, because the number is not fixed and can be big, and for most rows, there would be a lot of NULL values (that is the mathematical notion of a sparse matrix, that looks very ugly in a DB table)
You can take a look at Magento DB Model, a true reference in pure EAV at scale, or Wikipedia, but probably you can do that later, and for now, you just need the basics:
The basic idea is to store attributes, and their corresponding values as rows, instead of columns, in a single table.
In the simpler implementation the table has at least three columns: entity (usually a foreign key to an entity, or entity type/category), attribute (this can be a string, o a foreign key in more complex systems), and value.
In my previous example, oversimplifying, we could have a table like this, that lists attribute names and its values for
Item table Attributes table
+------+--------------+ +-------------+-----------+-------------+
| id | name | | item_id | attribute | value |
+------+--------------+ +-------------+-----------+-------------+
| 1 | "hard drive" | | 2 | "color" | "red" |
+------+--------------+ +-------------+-----------+-------------+
| 2 | "lipstick" | | 2 | "price" | 10 |
+------+--------------+ +-------------+-----------+-------------+
| 1 | "capacity"| "1TB" |
+-------------+-----------+-------------+
| 1 | "price" | 200 |
+-------------+-----------+-------------+
So for every item, you can have a list of attributes.
Since your model is more complex, has a few more constraints, so we need to adapt this model.
Since you want to limit the possible values, you will need a table for values
Since you will have a values table, the values hast to refer to an attribute, so you need the attributes to have an id, so you will have an attribute table
to make explicit and strict what categories have what attribute, you need a category-attribute table
With this, you end up with something like
Categories table
List of categories ids and names
+------+--------------+
| id | name |
+------+--------------+
| 1 | "chipset" |
+------+--------------+
| 2 | "interface" |
+------+--------------+
Attributes table
List of attribute ids and their name
+------+--------------+
| id | name |
+------+--------------+
| 1 | "interface" |
+------+--------------+
| 2 | "memory" |
+------+--------------+
| 3 | "tech" |
+------+--------------+
| 4 | "price" |
+------+--------------+
Category-Attribute table
What category has what attributes. Note that one attribute (i.e 4) can belong to 2 categories
+--------------+--------------+
| attribute_id | category_id |
+--------------+--------------+
| 1 | 1 |
+--------------+--------------+
| 2 | 1 |
+--------------+--------------+
| 3 | 1 |
+--------------+--------------+
| 4 | 1 |
+--------------+--------------+
| 4 | 2 |
+--------------+--------------+
Value table
List of possible values for every attribute
+----------+--------------+--------+
| value_id | attribute_id | value |
+-------------+-----------+--------+
| 1 | 2 | "ddr2" |
+----------+--------------+--------+
| 2 | 2 | "ddr3" |
+----------+--------------+--------+
| 3 | 2 | "ddr4" |
+----------+--------------+--------+
| 4 | 3 |"tech_1"|
+----------+--------------+--------+
| 5 | 3 |"tech_2"|
+----------+--------------+--------+
| 6 | ... | ... |
+----------+--------------+--------+
| 7 | ... | ... |
And finally, what you can imagine, the
Item-Attribute table will list one attribute value per row
+----------+--------------+-------+
| item_id | attribute_id | value |
+----------+-----------+----------+
| 1 | 2 | 1 |
+----------+--------------+-------+
| 1 | 2 | 3 |
+----------+--------------+-------+
Meaning that item 1, for attribute 2 (`memory`), has values 1 and 3 (`ddr2` and `ddr3`)
This will cover all your conditions:
Number of attributes is unlimited, as big as needed and not fixed
You can define clearly what category has what attributes
Two categories can have the same attribute
If 1 item belongs to two categories that have the same attribute, you can show only one (ie SELECT * from Category-Attribute where category_id in (SELECT category_id from ItemCategories where item_id = ...) will give you the list of eligible attributes, only one of each even if 2 categories had the same
You can do a rank, I think I dont have enough info for this query, but being this a fully normalized model, definitely, you can do a rank. You have here pretty much the full model, so surely you can figure out the query.
This is very similar to the model that Magento uses. It is very powerful but of course, it can get hard to manage, but it is the best way if we want to keep the model strict and make sure that it will enforce the constraints and that will accept all the SQL functions.
For systems less strict, it is always an option to go for a NoSQL database with much more flexible schemas.

Database design with multiple units and multiple attributes

Supposing I have this database design which I have researched.
Table: Products
ProductId | Name | BaseUnitId
1 | Lab gown | 1
2 | Gloves | 1
FK: BaseUnitId references Units.UnitId
Table: Units
UnitId | Name
1 | Each / Pieces
2 | Dozen
3 | Box
Table: Unit Conversion
ProdID | BaseUnitID | Factor | ConvertToUnitID
1 | 1 | 12 | 2
2 | 1 | 100 | 3
FK: BaseUnitId references Units.UnitID
FK: ConvertToUnitId references Units.UnitID
Table: Product Attribute
AttribId | Prod_ID | Attribute | Value
1 | 1 | Color | Blue
2 | 1 | Size | Large
3 | 2 | Color | Violet
4 | 2 | Size | Small
5 | 2 | Size | Medium
6 | 2 | Size | Large
7 | 2 | Color | White
FK: Prod_ID references Product.ProductID
Table: Inventory
Prod_ID | Base Unit Qty | Expiry
1 | 12 | n/a
2 | 100 | 2020-01-01
2 | 100 | 2021-12-31
FK: Prod_ID references Product.ProductID
How can I breakdown the inventory per unit per attribute?
e.g How can I get the inventory of SMALL VIOLET GLOVES? LARGE WHITE GLOVES?
Any suggestions? My idea is to create another table which will link product unit, product attribute and quantity.
But I dont know how to link the size attribute and color attribute to a unit.
Lastly, is there something wrong with this design?
I think it is quite wrong to split off the attributes of a product into a different table. I understand the desire to normalize, but it should be done differently.
I'd handle a product and its attributes like this:
CREATE TABLE product (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name text NOT NULL,
baseunit_id bigint NOT NULL REFERENCES unit
);
CREATE TABLE inventory (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
product_id bigint NOT NULL REFERENCES product,
color integer REFERENCES product_color,
size integer REFERENCES product_size,
other_attributes jsonb
);
That also makes sense if you think about in natural language terms: “How many dozens of large blue gloves do we have on store?”
Attributes that do not apply to a certain product can be left NULL.
I make a distinction between common and rare attributes. Common attributes have their own column. Rare attributes are bunched together in a jsonb column. I know that the latter is not normalized nor pretty, but varying attributes are not very suited for a relational model. A GIN index on the column will allow searches to be efficient.

Sane way to store different data types within same column in postgres?

I'm currently attempting to modify an existing API that interacts with a postgres database. Long story short, it's essentially stores descriptors/metadata to determine where an actual 'asset' (typically this is a file of some sort) is storing on the server's hard disk.
Currently, its possible to 'tag' these 'assets' with any number of undefined key-value pairs (i.e. uploadedBy, addedOn, assetType, etc.) These tags are stored in a separate table with a structure similar to the following:
+---------------+----------------+-------------+
|assetid (text) | tagid(integer) | value(text) |
|---------------+----------------+-------------|
|someStringValue| 1234 | someValue |
|---------------+----------------+-------------|
|aDiffStringKey | 1235 | a username |
|---------------+----------------+-------------|
|aDiffStrKey | 1236 | Nov 5, 1605 |
+---------------+----------------+-------------+
assetid and tagid are foreign keys from other tables. Think of the assetid representing a file and the tagid/value pair is a map of descriptors.
Right now, the API (which is in Java) creates all these key-value pairs as a Map object. This includes things like timestamps/dates. What we'd like to do is to somehow be able to store different types of data for the value in the key-value pair. Or at least, storing it differently within the database, so that if we needed to, we could run queries checking date-ranges and the like on these tags. However, if they're stored as text items in the db, then we'd have to a.) Know that this is actually a date/time/timestamp item, and b.) convert into something that we could actually run such a query on.
There is only 1 idea I could think of thus far, without complete changing changing the layout of the db too much.
It is to expand the assettag table (shown above) to have additional columns for various types (numeric, text, timestamp), allow them to be null, and then on insert, checking the corresponding 'key' to figure out what type of data it really is. However, I can see a lot of problems with that sort of implementation.
Can any PostgreSQL-Ninjas out there offer a suggestion on how to approach this problem? I'm only recently getting thrown back into the deep-end of database interactions, so I admit I'm a bit rusty.
You've basically got two choices:
Option 1: A sparse table
Have one column for each data type, but only use the column that matches that data type you want to store. Of course this leads to most columns being null - a waste of space, but the purists like it because of the strong typing. It's a bit clunky having to check each column for null to figure out which datatype applies. Also, too bad if you actually want to store a null - then you must chose a specific value that "means null" - more clunkiness.
Option 2: Two columns - one for content, one for type
Everything can be expressed as text, so have a text column for the value, and another column (int or text) for the type, so your app code can restore the correct value in the correct type object. Good things are you don't have lots of nulls, but importantly you can easily extend the types to something beyond SQL data types to application classes by storing their value as json and their type as the class name.
I have used option 2 several times in my career and it was always very successful.
Another option, depending on what your doing, could be to just have one value column but store some json around the value...
This could look something like:
{
"type": "datetime",
"value": "2019-05-31 13:51:36"
}
That could even go a step further, using a Json or XML column.
I'm not in any way PostgreSQL ninja, but I think that instead of two columns (one for name and one for type) you could look at hstore data type:
data type for storing sets of key/value pairs within a single
PostgreSQL value. This can be useful in various scenarios, such as
rows with many attributes that are rarely examined, or semi-structured
data. Keys and values are simply text strings.
Of course, you have to check how date/timestamps converting into and from this type and see if it good for you.
You can use 2 different technics:
if you have floating type for every tagid
Define table and ID for every tagid-assetid combination and actual data tables:
maintable:
+---------------+----------------+-----------------+---------------+
|assetid (text) | tagid(integer) | tablename(text) | table_id(int) |
|---------------+----------------+-----------------+---------------|
|someStringValue| 1234 | tablebool | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStringKey | 1235 | tablefloat | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStrKey | 1236 | tablestring | 123 |
+---------------+----------------+-----------------+---------------+
tablebool
+-------------+-------------+
| id(integer) | value(bool) |
|-------------+-------------|
| 123 | False |
+-------------+-------------+
tablefloat
+-------------+--------------+
| id(integer) | value(float) |
|-------------+--------------|
| 123 | 12.345 |
+-------------+--------------+
tablestring
+-------------+---------------+
| id(integer) | value(string) |
|-------------+---------------|
| 123 | 'text' |
+-------------+---------------+
In case if every tagid has fixed type
create tagid description table
tag descriptors
+---------------+----------------+-----------------+
|assetid (text) | tagid(integer) | tablename(text) |
|---------------+----------------+-----------------|
|someStringValue| 1234 | tablebool |
|---------------+----------------+-----------------|
|aDiffStringKey | 1235 | tablefloat |
|---------------+----------------+-----------------|
|aDiffStrKey | 1236 | tablestring |
+---------------+----------------+-----------------+
and correspodnding data tables
tablebool
+-------------+----------------+-------------+
| id(integer) | tagid(integer) | value(bool) |
|-------------+----------------+-------------|
| 123 | 1234 | False |
+-------------+----------------+-------------+
tablefloat
+-------------+----------------+--------------+
| id(integer) | tagid(integer) | value(float) |
|-------------+----------------+--------------|
| 123 | 1235 | 12.345 |
+-------------+----------------+--------------+
tablestring
+-------------+----------------+---------------+
| id(integer) | tagid(integer) | value(string) |
|-------------+----------------+---------------|
| 123 | 1236 | 'text' |
+-------------+----------------+---------------+
All this is just for general idea. You should adapt it for your needs.

Query join result appears to be incorrect

I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?