Filter and display database audit / changelog (activity stream) - postgresql

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.

Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

Related

Sqlalchemy way to handle select based on outdated data

Assuming I have a table sets with a field filters, containing array of key - value mappings, table items to which select query must be applied to extract rows based on these filters, and associated table for M:M relations to link each set with each item. I am seeking for a method or mechanism to cancel select query if sets.filters were updated, otherwise M:M relation will be built invalid as based on yet not refreshed filters.
The concrete scenario when a problem takes place is:
Receive file with items data, to parse, and insert into items returning new relevant ids(primary keys here);
After insertion, select from relevant sets for filters;
Take items ids and select from items using filters;
Update M:M association table for all the items returned at step 3.
So, unfortunately between step 3 and 4 or even earlier, API call makes an update on one of the sets rows, changing its filters. As the result - M:M table is invalid, because one filter was changed(lets say the filters contained kind of weight <= 100 kilos expression, however after the mentioned update it has become weight <= 50 kilos, so if there are some new items with weight greater than 50, those items ids should not be in M:M table, obviously).
Is there some efficient way to cancel select query from items during transaction? Or maybe there is a strong query to use. My idea is to rollback changes post-factum, checking sets.modified_at column. But it seems as doing additional job by wasting disk and cpu time.

Track history of ManyToMany Relationship table with extra fields

I have a many to many relationship table (named UserLabel) in the postgres db with some extra field. I want to be able to track the history of changes to this many to many table. I came up with the following structure. I'd like to know if there's any better way of implementing it
User
id
Label
id
UserLabel
id
user_id
label_id
label_info (jsonb)
is_deleted (true or false)
UserLabel can contain more than one record with same user_id and label_id but with different label_info. At any point of time, if I want to query for all the labels for a given user I can do that using this table. Now, updates could occur on this table on label_id or label_info or is_deleted fields. I want to be able to know at any given point of time, what were the labels and label info of a user. For this, I'm using the below table.
UserLabelEvent
id
user_label_id
user_id
label_id
label_info
change_type (value will be one of (create, update, delete))
created_timestamp
If I want to check the user labels for any user at any time, I just have to query on user_id and created_timestamp and order the records by created_timestamp and loop over the records to construct the user labels at any given time.
The problems in my current approach:
By default, anyone seeing at the schema of UserLabel table feels like there cannot be more than one record with same user_id and same label_id.
By looking at the UserLabelEvent, it's not obvious to understand how that table is working.
I need to do some post processing to find out the user labels at any given time. By post processing, I mean, loop over the query results and construct the user labels.
Please do suggest any other problems you find with this approach. I will update the post with new inputs.

Multiple optional query parameters with PostgreSQL

I use PostgreSQL10 and I want to built queries that have multiple optional parameters.
A user must input area name, but then it is optional to pick none or any combination of the following event, event date, category, category date, style
So a full query could be "all the banks (category), constructed in 1990 (category date) with modern architecture (style), that got renovated in 1992 (event and event date) in the area of NYC (area) ".
My problem is that all those are in different tables, connected by many-to-many tables, so I cannot do something like
SELECT * FROM mytable
WHERE (Event IS NULL OR Event = event)
I dont know if any good will come if I just join four tables.
I can easily find the area id, since it is required, but I dont know what the user chose, beside that.
Any suggestions on how to approach this, with Postgre?
Thanks
It might be optimal to build the entire query dynamically and only join in tables that you know you're going to need in order to apply the user's filters, but it's impractical. You're better off creating a view on the full set of tables. Use LEFT OUTER JOINs to ensure that you don't accidentally filter out valid combinations and index your tables to ensure that the query planner can navigate the table graph quickly. Then query the view with a WHERE clause reflecting only the filters you want to apply.
If performance becomes a concern and you don't mind having non-realtime data, you could use a materialized view to cache the results. Materialized views can be indexed directly, but this is a pretty radical change so don't do this unless you have to.

TSQL - Deleting with Inner Joins and multiple conditions

My question is a variation on one already asked and answered (TSQL Delete Using Inner Joins) but I have a different level of complexity and I couldn't see a solution to it.
My requirement is to delete Special Prices which haven't been accessed in 90 days. Special Prices are keyed on Customer ID and Product ID and the products have to matched to a Customer Order Detail table which also contains a Customer ID and a Product ID. I want to write one function that will look at the Special Price table for each Customer, compare each Product for that Customer with the Customer Order Detail table and if the Maximum Order Date is more than 90 days earlier than today, delete it from the Special Price table.
I know I can use a CURSOR (slow but effective) but would prefer to have a single query like the one in the TSQL Delete Using Inner Joins example. Any ideas and/or is more information required?
I cannot dig more on the situation of your system but i think and if it is ok for you, check MERGE STATEMENT, it might be a help instead of using cursors. check this Link MERGE STATEMENT

Most efficient database schema for counting keywords

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.
My question is:
What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?
My two ideas (which may be terrible):
Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.
Have a database with each word as a column and each product as a row and the corresponding row/column would contain the count.
The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):
class Tag(db.Model):
# Tags should be child entities of Products and have key name based on the tag
# eg, created with Tag(parent=a_product, key_name='awesome', ...)
count = db.IntegerProperty(required=True, default=0)
#classmethod
def increment_tags(cls, product, tag_names):
def _tx():
tags = cls.get_by_key_name(tag_names, parent=product)
for i, tag in enumerate(tags):
if tag is None:
# New tag
tags[i] = tag = cls(key_name=tag_names[i], parent=product)
tag.count += 1
db.put(tags)
return db.run_in_transaction(_tx)
#classmethod
def get_top_product_tags(cls, product, num=5):
return [x.key().name() for x
in cls.all().ancestor(product).order('-count').fetch(num)]
The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.
The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.
You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:
create table KeywordUse
( AppID int
, UserID int
, Sequence int
, Word varchar(50) -- or whatever makes sense
)
You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).
To find the top 5 keywords for any app, do a SQL query like this:
select top 5
count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = #AppIDVariable...
group by Word, AppID
order by count(AppID) desc
If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.
REVISED ANSWER:
As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.
The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := #Value and order by Frequency (descending) with a LIMIT = 5.
You would need a separate table to track user keywords if that is important.