Relational databse design to represent similarity between rows of same table

Relational databse design to represent similarity between rows of same table - postgresql

For background purposes: I'm using PostgreSQL with SQLAlchemy (Python).
Given a table of unique references as such:
references_table
-----------------------
id | reference_code
-----------------------
1 | CODEABCD1
2 | CODEABCD2
3 | CODEWXYZ9
4 | CODEPOIU0
...
In a typical scenario, I would have a separate items table:
items_table
-----------------------
id | item_descr
-----------------------
1 | `Some item A`
2 | `Some item B`
3 | `Some item C`
4 | `Some item D`
...
In such typical scenario, the many-to-many relationship between references and items is set in a junction table:
references_to_items
-----------------------
ref_id (FK) | item_id (FK)
-----------------------
1 | 4
2 | 1
3 | 2
4 | 1
...
In that scenario, it is easy to model and obtain all references that are associated to the same item, for instance item 1 has references 2 and 4 as per table above.
However, in my scenario, there is no items_table. But I would still want to model the fact that some references refer to the same (non-represented) item.
I see a possibility to model that via a many-to-many junction table as such (associating FKs of the references table):
reference_similarities
-----------------------
ref_id (FK) | ref_id_similar (FK)
-----------------------
2 | 4
2 | 8
2 | 9
...
Where references with ID 2, 4, 8 and 9 would be considered 'similar' for the purposes of my data model.
However, the inconvenience here is that such model requires to choose one reference (above id=2) as a 'pivot', to which multiple others can be declared 'similar' in the reference_similarities table. Ref 2 is similar to 4 and ref 2 is similar to 8 ==> thus 4 is similar to 8.
So the question is: is there a better design that doesn't involve having a 'pivot' FK as above?
Ideally, I would store the 'similarity' as an Array of FKs as such:
reference_similarities
------------------------
id | ref_ids (Array of FKs)
------------------------
1 | [2, 4, 8, 9]
2 | [1, 3, 5]
..but I understand from https://dba.stackexchange.com/questions/60132/foreign-key-constraint-on-array-member that it is currently not possible to have foreign keys in PostgreSQL arrays. So I'm trying to figure out a better design for this model.

I can understand that you want to group items in a set, and able to query the set from any of item in it.
You can use a hash function to hash a set, then use the hash as pivot value.
For example you have a set of values (2,4,8,9), it will be hashed like this:
hash = ((((31*1 + 2)*31 + 4)*31 + 8)*31 + 9
you can refer to Arrays.hashCode in Java to know how to hash a list of values.
int result = 1;
for (Object element : a)
result = 31 * result + (element == null ? 0 : element.hashCode());
Table reference_similarities:
reference_similarities
-----------------------
ref_id (FK) | hash_value
-----------------------
2 | hash(2, 4, 8, 9) = 987204
4 | 987204
8 | 987204
9 | 987204
To query the set, you can first query hash_value from ref_id first, then, get all ref_id from hash_value.
The draw back of this solution is every time you add a new value to a set, you have to rehash the set.
Another solution is you can just write a function in Python to produce a unique hash_value when creating a new set.

Related

How to convert nested json to data frame with kdb+

I am trying to get the data from cryptostats like below, it gives me back a nested json. I want it to be in a table format. How do I do that?
query:"https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07";
raw:.Q.hg query;
res:.j.k raw;
To get json file, use https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07
To view json code into a table format, use https://jsongrid.com/json-grid
Final result would be a kdb+ table which has all the cols from nested json output

They are all dictionaries
q)distinct type each res[`data]
,99h
But they do not collapse to a table because they do not all have matching keys
q)distinct key each res[`data]
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
Looking at a row where errors is populated we can see it is a dictionary
q)res[`data;0;`errors]
oneDayTotalFees| "Error executing oneDayTotalFees on compound: Date incomplete"
You can create a prototype dictionary with a blank errors key in it and join , each piece of data onto it. This will result in uniform dictionaries which will be promoted to a table type 98h
q)table:(enlist[`errors]!enlist (`$())!()),/:res`data
q)type table
98h
Row which already had errors is unaffected:
q)table 0
errors | (,`oneDayTotalFees)!,"Error executing oneDayTotalFees..
id | "compound"
bundle | 0n
results | (,`oneDayTotalFees)!,0n
metadata| `source`icon`name`category`description`feeDescription;..
Row which previously did not have errors now has a valid empty dictionary
q)table 1
errors | (`symbol$())!()
id | "swapr-ethereum"
bundle | "swapr"
results | (,`oneDayTotalFees)!,24.78725
metadata| `category`name`icon`bundle`blockchain`description`feeDescription..
https://kx.com/blog/kdb-q-insights-parsing-json-files/
https://code.kx.com/q/ref/join/
https://code.kx.com/q/kb/faq/#construction
https://code.kx.com/q/basics/datatypes/
https://code.kx.com/q/ref/maps/#each-left-and-each-right
If you want to explore nested objects you can index at depth (see blog post linked above). If you have many sparse keys leaving it like this is efficient for storage:
q)select tokenSymbol:metadata[::;`tokenSymbol] from table where not ""~/:metadata[::;`tokenSymbol]
tokenSymbol
-----------
"HNY"
If you do wish to explode a nested field you can run similar to:
q)table:table,'{flip c!flip table[`metadata]#\:(c:distinct raze key each table[`metadata])}[]
q)meta table
c | t f a
----------------| -----
errors |
id | C
bundle | C
results |
metadata |
source | C
icon | C
name | C
category | C
description | C
feeDescription | C
blockchain | C
website | C
tokenTicker | C
tokenCoingecko | C
protocolLaunch | C
tokenLaunch | C
adapter | C
subtitle | C
events | C
shortName | C
protocolShutdown| C
tokenSymbol | C
subcategory | C
tokenticker | C
tokencoingecko | C
Care needs to be taken will filling in nulls and keeping consistent types of data in each column. In this dataset the events tag inside metadata is tabular data:
q)select distinct type each events from table
events
------
10
98
0
This would need to be cleaned similar to:
q)table:update events:count[i]#enlist ([] date:();description:()) from table where not 98h=type each events

The data returned from the API contains dictionaries with two distinct sets of keys:
q)distinct key each res`data
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
One simple way to convert this to a table is to enlist each dictionary first, converting them to tables, then joining with uj:
q)(uj/)enlist each res`data
id bundle results metadata ..
-----------------------------------------------------------------------------..
"compound" 0n (,`oneDayTotalFees)!,0n `source`i..
"swapr-ethereum" "swapr" (,`oneDayTotalFees)!,24.78725 `category..
...
This works as uj generalises the join operator ,, allowing different schemas with common elements to be combined.

Best way to maintain an ordered list in PostgreSQL?

Say I have a table called list, where there are items like these (the ids are random uuids):
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 Bar
x 4 Baz
I want to maintain the property that rank column always goes from 0 to n-1 (n being the number of rows)---if a client asks to insert an item with rank = 3, then the pg server should push the current 3 and 4 to 4 and 5, respectively:
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 New Item!
x 4 Bar
x 5 Baz
My current strategy is to have a dedicated insertion function add_item(item) that scans through the table, filter out items with rank equal or greater than that of the item being inserted, and increment those ranks by one. However, I think this approach will run into all sorts of problems---like race conditions.
Is there a more standard practice or more robust approach?
Note: The rank column is completely independent of rest of the columns, and insertion is not the only operation I need to support. Think of it as the back-end of a sortable to-do list, and the user can add/delete/reorder the items on the fly.

Doing verbatim what you suggest might be difficult or not possible at all, but I can suggest a workaround. Maintain a new column ts which stores the time a record is inserted. Then, insert the current time along with rest of the record, i.e.
id rank text ts
--- ----- ----- --------------------
x 0 Hello 2017-12-01 12:34:23
x 1 World 2017-12-03 04:20:01
x 2 Foo ...
x 3 New Item! 2017-12-12 11:26:32
x 3 Bar 2017-12-10 14:05:43
x 4 Baz ...
Now we can easily generate the ordering you want via a query:
SELECT id, rank, text,
ROW_NUMBER() OVER (ORDER BY rank, ts DESC) new_rank
FROM yourTable;
This would generate 0 to 5 ranks in the above sample table. The basic idea is to just use the already existing rank column, but to let the timestamp break the tie in ordering should the same rank appear more than once.

you can wrap it up to function if you think its worth of:
t=# with u as (
update r set rank = rank + 1 where rank >= 3
)
insert into r values('x',3,'New val!')
;
INSERT 0 1
the result:
t=# select * from r;
id | rank | text
----+------+----------
x | 0 | Hello
x | 1 | World
x | 2 | Foo
x | 3 | New val!
x | 4 | Bar
x | 5 | Baz
(6 rows)
also worth of mention you might have concurrency "chasing condition" problem on highly loaded systems. the code above is just a sample

You can have a “computed rank” which is a double precision and a “displayed rank” which is an integer that is computed using the row_number window function on output.
When a row is inserted that should rank between two rows, compute the new rank as the arithmetic mean of the two ranks.
The advantage is that you don't have to update existing rows.
The down side is that you have to calculate the displayed ranks before you can insert a new row so that you know where to insert it.
This solution (like all others) are subject to race conditions.
To deal with these, you can either use table locks or serializable transactions.

The only way to prevent a race condition would be to lock the table
https://www.postgresql.org/docs/current/sql-lock.html
Of course this would slow you down if there are lots of updates and inserts.
If can somehow limit the scope of your updates then you can do a SELECT .... FOR UPDATE on that scope. For example if the records have a parent_id you can do a select for update on the parent record first and any other insert who does the same select for update would have to wait till your transaction is done.
https://www.postgresql.org/docs/current/explicit-locking.html#:~:text=5.-,Advisory%20Locks,application%20to%20use%20them%20correctly.
Read the section on advisory locks to see if you can use those in your application. They are not enforced by the system so you'll need to be careful of how you write your application.

Check if field value is in a list of strings in SSRS report

I'm using SSRS (VS2008) and creating a report of work orders. In the detail line of the report table, I have the following columns (with some fake data)
WONUM | A | B | Hours
ABC123 | 3 | 0 | 3
SPECIAL| 0 | 6 | 6
DEF456 | 5 | 0 | 5
GHI789 | 4 | 0 | 4
OTHER | 0 | 2 | 2
As you can kind of see, all work orders have a work order number (WONUM) as well as a total # of hours (HOURS). I need to put the hours into either column A or column B based on WONUM. I have a list of specifically named work orders (in the example, they would be "SPECIAL" and "OTHER") which would cause the HOURS value to be put in column B. If the WONUM is NOT a special named one, then it goes in column A. Here's what I WANTED to put as the expression for column A and column B:
Column A: =IIF(Fields!WONUM.Value IN ("SPECIAL","OTHER"), 0, Fields!Hours.Value)
Column B: =IIF(Fields!WONUM.Value IN ("SPECIAL","OTHER"), Fields!Hours.Value, 0)
But as you're probably aware, Fields!WONUM.Value IN ("SPECIAL","OTHER") is not a valid method of doing this! What is the best way to make this work? I cannot flag it in the SQL query in any other way for other reasons so it must be done in the table.
Thanks in advance for any and all help!

Try this, (Using InStr() function)
IIF(InStr(Fields!WONUM.Value,"SPECIAL")>0 OR InStr(Fields!WONUM.Value,"OTHER")>0, 0, Fields!Hours.Value)
IIF(InStr(Fields!WONUM.Value,"SPECIAL")>0 OR InStr(Fields!WONUM.Value,"OTHER")>0, Fields!Hours.Value,0)

If it's just the two WONUMs then you can do this:
Column A:
=IIF((Fields!WONUM.Value <> "SPECIAL") AND (Fields!WONUM.Value <> "OTHER"), Fields!Hours.Value, 0)
Column B:
=IIF((Fields!WONUM.Value = "SPECIAL") OR (Fields!WONUM.Value = "OTHER"), Fields!Hours.Value, 0)
or use the same formula in each column for consistency and swap the field/0 at the end.

How to remove column-duplicates from the query result using entity-framework?

On my database table I have
Key | Value
a | 1
a | 2
b | 11
c | 1
d | 2
b | 3
But I just need to get the items which keys are not duplicates of the previous rows. The desired result should be:
Key | Value
a | 1
b | 11
c | 1
d | 2
How could we get the desired result using entity-framework?
Note: we need the first value. Thank you very much.

var q = from e in Context.MyTable
group e by e.Key into g
select new
{
Key = g.Key,
Value = g.OrderBy(v => v.Value).FirstOrDefault()
};

You should look at either writing a View in the database and mapping your entity to that.
Or creating a DefiningQuery in the part of your EDMX (aka the bit that ends up in the SSDL file).
See Tip 34 for more information.
Conceptually both approaches allow you to write a view that excludes the 'duplicate rows'. The difference is just where the view lives.
If you have control over the database - I'd put the view in the database
If not you can put the view in your inside the and then map to that.
Hope this helps
Alex

EF 4 - associations with keys that dont match

We're using POCOs and have 2 entities: Item and ItemContact. There are 1 or more contacts per item.
Item has as a primary key:
ItemID
LanguageCode
ItemContact has:
ItemID
ContactID
We cant add an association with a referrential constraint as they have differing keys. There isnt a strict primary / foreign key as languageCode isnt in ItemContact and ContactID isnt in Item.
How can we go about mapping this with an association for contacts for an item if there isnt a direct link but I still want to see the contacts for an item?
One of the entities originates in a database view so it is not possible to add foreign keys to the database
Thanks
Stephen Ward

In order to create any relationship (in EF or any ORM for that matter) you have to have something to Join on.
Because at the moment your don't, you need to fabricate something...
The only option I can think of is to create a Relationship - using some of the same techniques described in here to create an SSDL view to back the relationship using a <DefiningQuery> based on a cross product join.
So if you have data like this:
ItemID | LanguageCode
1 | a
and this:
ItemID | ContactID
1 | x
1 | y
1 | z
Then your <DefiningQuery> should have T-SQL that produces something like this:
Item_ItemID | Item_LanguageCode | ItemContact_ItemID | ItemContact_ContactID
1 | a | 1 | x
1 | a | 1 | y
1 | a | 1 | z
Now because this is technically an Independent Association - as opposed to an FK association - you should be able to claim in the CSDL that the cardinality is 1 - * even though there is nothing in the SSDL to constrain it - and stop it from being a * - *.
Hope this helps
Alex

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Relational databse design to represent similarity between rows of same table - postgresql

Related

How to convert nested json to data frame with kdb+

Best way to maintain an ordered list in PostgreSQL?

Check if field value is in a list of strings in SSRS report

How to remove column-duplicates from the query result using entity-framework?

EF 4 - associations with keys that dont match

Categories

Resources