Query on multiple postgres hstores combined with or - postgresql

This is a hardcoded example of what I'm trying to achieve:
SELECT id FROM places
WHERE metadata->'route'='Route 23'
OR metadata->'route'='Route 22'
OR metadata->'region'='Northwest'
OR metadata->'territory'='Territory A';
Metadata column is an hstore column and I'm wanting to build up the WHERE clause dynamically based on another query from a different table. The table could either be:
id | metadata
---------+----------------------------
1647 | "region"=>"Northwest"
1648 | "route"=>"Route 23"
1649 | "route"=>"Route 22"
1650 | "territory"=>"Territory A"
or
id | key | value
----+-------------+-------+---
1 | route | Route 23
2 | route | Route 22
3 | region | Northwest
4 | territory | Territory A
Doesnt really matter, just whatever works to build up that where clause. It could potentially have 1 to n number of OR's in it based on the other query.

Ended up with a solution using the 2nd table (distribution table):
id | metadata
---------+----------------------------
1647 | "region"=>"Northwest"
1648 | "route"=>"Route 23"
1649 | "route"=>"Route 22"
1650 | "territory"=>"Territory A"
Used the following join, which the #> sees if the places.metadata contains the distributions.metadata
SELECT places.id, places.metadata
FROM places INNER JOIN distributions
ON places.metadata #> distributions.metadata
WHERE distributions.some_other_column = something;

Related

What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|

Selecting value for the latest two distinct columns

I am trying to do an SQL which will return the latest data value of the two distinct columns of my table.
Currently, I select distinct the values of the column and afterwards, I iterate through the columns to get the distinct values selected before then order and limit to 1. These tags can be any number and may not always be posted together (one time only tag 1 can be posted; whereas other times 1, 2, 3 can).
Although it gives the expected outcome, this seems to be inefficient in a lot of ways, and because I don't have enough SQL experience, this was so far the only way I found of performing the task...
--------------------------------------------------
| name | tag | timestamp | data |
--------------------------------------------------
| aa | 1 | 566 | 4659 |
--------------------------------------------------
| ab | 2 | 567 | 4879 |
--------------------------------------------------
| ac | 3 | 568 | 1346 |
--------------------------------------------------
| ad | 1 | 789 | 3164 |
--------------------------------------------------
| ae | 2 | 789 | 1024 |
--------------------------------------------------
| af | 3 | 790 | 3346 |
--------------------------------------------------
Therefore the expected outcome is {3164, 1024, 3346}
Currently what I'm doing is:
"select distinct tag from table"
Then I store all the distinct tag values programmatically and iterate programmatically through these values using
"select data from table where '"+ tags[i] +"' in (tag) order by timestamp desc limit 1"
Thanks,
This comes close, but beware if you have two rows with the same tag share a maximum timestamp you will get duplicates in the result set
select data from table
join (select tag, max(timestamp) maxtimestamp from table t1 group by tag) as latesttags
on table.tag = latesttags.tag and table.timestamp = latesttags.maxtimestamp

Grouping fields from one to many relationship in postgres

I need to group fields in a child table in one query in postgres.
I have following data
Stores:
| id | name |
|----|------|
| 1 | abcd |
Features:
| id | store | name | other |
|----|-------|------|-------|
| 1 | 1 | door | metal |
| 2 | 1 | fork | green |
I've got to this query
SELECT
stores.id,
stores.name,
concate_ws(',', features.id, features.name, features.other)
FROM stores
LEFT JOIN features
ON(features.store=stores.id)
WHERE stores.id =1
GROUP BY stores.id, features.id;
This is best I've got so far but yields 2 tuples
1, abcd, (1,door,metal)
1, abcd, (2,fork,green)
I'd like to be able to get one row with the features '|' concatenated like so
1, abcd ,(1,door,metal|2,fork,green)
Use string_agg():
SELECT stores.id,
stores.name,
string_agg(concate_ws(',', features.id, features.name, features.other), '|')
FROM stores
LEFT JOIN features ON features.store=stores.id
WHERE stores.id =1
GROUP BY stores.id, stores.name;

Sum of the most recent non-null columns (window function with "ignore nulls")

I am using PostgreSQL 9.1.9.
In the project I am working on, some most recent records have null columns because that information was not available when that row was created. I have a view that lists the sum of rows that belongs to the members of a group. As of right now, the view shows the sum of the most recent columns, which uses null values if those are the most recent values. For example,
table1
group_name | member
-------------------
group1 | Andy
group1 | Bob
table2
name | stat_date | col1 | col2 | col 3
--------------------------------------
Andy | 6/19/13 | null | 1 | 2
Andy | 6/18/13 | 100 | 3 | 5
Bob | 6/19/13 | 50 | 9 | 12
Bob | 6/18/13 | 111 | 31 | 51
-- creating view would be something like this...
create view v_grouped as
select table1.group_name, stat_date,
sum(col1) as col1_sum, sum(col2) as col2_sum, sum(col3) as col3_sum
from table1
join table2 on table1.member = table2.name
group by table1.group_name, table2.stat_date;
Current view looks like this:
group_name | stat_date | col1_sum | col2_sum | col3_sum
-------------------------------------------------------
group1 | 6/19/13 | 50 | 10 | 14
group2 | 6/18/13 | 211 | 34 | 56
Instead of 50, 150 would be a closer representation of what the actual group total is, despite lack of data for 6/19. So, I want an output of
group_name | stat_date | col1_sum | col2_sum | col3_sum
-------------------------------------------------------
group1 | 6/19/13 | 150 | 10 | 14
group2 | 6/18/13 | 211 | 34 | 56
I've been looking at first_value() from window functions as a possible function to use. I found that Oracle's first_value() supports the ignore nulls option which I believe will do what I want (http://psoug.org/definition/FIRST_VALUE.htm). According to this page I linked, about PL/SQL's first_value() function:
If the first value in the result set is NULL then the function returns NULL unless you specify IGNORE NULLS.
If you use the IGNORE NULLS parameter then FIRST_VALUE will return the first non-null value found in the result set. (If all
values are null then it will return NULL.)
Example Syntax: FIRST_VALUE(expression [INGORE NULLS]) OVER (analytic_clause)
But PostgreSQL's first_value() does not support such an option. Is there a way to do this in PostgreSql? Thank you in advance!
You can use this custom aggregate as a postgres variant of FIRST_VALUE(expression INGORE NULLS). Or build your own aggregate with desired behavior.
Is this what you are trying to describe?
SELECT sum(col1), sum(col2), sum(col3) FROM table2 WHERE col1 IS NOT NULL
(although I omitted the join on table1; that is an exercise for the reader)

How do I get a single result, from two tables, where the 2nd table contains an updated version of a record from the 1st?

I have two tables, CompanyAddresses & MyCompanyAddresses. (Names changed to protect the guilty).
CompanyAddresses holds a list of default addresses for companies. These records are immutable. The user can change the details of a company address, but those changes are stored MyCompanyAddresses.
How can I produce a single list of addresses from both tables, excluding records from CompanyAddresses where a corresponding record exists in MyCompanyAddresses?
Sample Data
CompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
1 | Guid1 | APL | Apple | 1 Infinite Loop | Cupertino | 95014 | 11/1/2012
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
MyCompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
Desired Results
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
I've tried various permutations of MS SQL's UNION, EXCEPT & INTERSECT to no avail. Also, I don't believe JOIN's are the answer either, but I'll be happily proven wrong.
The database design can be changed, but it would be preferable if it stayed the same.
Use a LEFT JOIN in combination with COALESCE. If the JOIN finds a match, the COALESCE will select values from the overridden row. If no match is found, the original values are returned.
SELECT ca.DatabaseId,
COALESCE(mca.Id, ca.Id) AS Id,
COALESCE(mca.Name, ca.Name) AS Name,
COALESCE(mca.Street, ca.Street) AS Street,
COALESCE(mca.City, ca.City) AS City,
COALESCE(mca.Zip, ca.Zip) AS Zip,
COALESCE(mca.MaintDate, ca.MaintDate) AS MaintDate,
FROM CompanyAddresses ca
LEFT JOIN MyCompanyAddresses mca
ON ca.Code = mca.Code;