Can I obtain a postgresql lock that prevents concurrent inserts for a portion of a table? - postgresql

I have a join table attendees of rooms and users with an extra column that represents a randomly assigned string (from within a pool of strings).
I want to prevent two users entering at the same time from accidentally getting assigned the same string -- so I'd want User B to wait to look up the previous users in the room (and their assigned strings) until User A has completed inserting -- but I don't want to do a table lock since a row insertion for say room_id = 12 doesn't affect an insertion for room_id = 77.
I can't use a unique constraint to solve this, since duplicate strings are possible in the case of a large number of users in a single room (the strings get reassigned evenly once all of them have been used once).*
My guess is doing something like
SELECT room_id, user_id, random_string WHERE room_id = ? FOR UPDATE
isn't going to help because it's not going to prevent User B from doing an insert -- and even if that SELECT FOR UPDATE prevented user B from doing the same call to read the rows corresponding to that room, what happens if both User A and User B are the first ones to join (and there aren't any rows for that room_id to lock)?
Would I use an advisory lock that's keyed on the room_id? Would that still help if I had multiple concurrent writes (e.g. User A should finish first, then User B, then User C)?
*Here's the example of why the strings aren't necessarily unique: say the pool of strings is "red", "blue", "green" -- the first three users that enter are each assigned one of them randomly, then the pool resets. The next three users are also assigned them randomly, so for the six users in the room, exactly two would have "red", two "blue", and two "green".

After a frustrating few days on SO I ended up getting a helpful response on Reddit that solves the problem:
Use a SELECT FOR UPDATE query on the rooms table, locking the row that corresponds to the room until the first user’s transaction completes.

Related

Is there a way to include a column from one table in many other tables (while maintaining consistency) in PostgreSQL?

I'm trying to build a database (in PostgreSQL 9.6.6) that allows for one "master column" (items.id) to be replicated in to many (automatically generated) tables (e.g. rank1.id, rank2.id, rank3.id, ...). Only items will have INSERT's (or DELETE's) performed and when they are the newly added id's should also show up (or be removed) in the rankX table(s). To be more concrete:
items:
id | name | description
rank1:
id | rank
rank2:
id | rank
...
Where the id's are always the same, and there is always the same number of rows in each of the tables. The rankX.rank values, however, will be different (imagine users ranking how funny a series of images are -- the images all have the same id's but different users might rank them differently).
What I was thinking was that when a new user was added and a new rankX table created I would do the following:
Have rankX.id referencing a foreign key items.id (with ON DELETE CASCADE)
Copy any items.id that already exist
Auto-generate a trigger function that mirrors the INSERT's to items to the rankX table
This seems cumbersome and wasteful of space since all of the xxxx.id columns are identical and I will end up with hundreds or thousands of trigger functions. As someone new to relational databases I was hoping there was an easier way to achieve this.
So, I have a few questions:
Is there a more efficient way to define my tables such that all of this copying isn't necessary?
If this the best way, can you give an example of how you would set up the triggers (and associated functions)?
Do I need to worry about running out of space on the server as I create (potentially many) sets of triggers of this type?

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

NOSQL Table Schema

I'm trying to plan a NOSQL table schema. There are relationships in my data, but they are mostly what would be N:N in a relational db; there are very few normal 1:N relationships.
So in this case, I'm trying to create implicit relationships that will allow me to browse from both ends of the relationship. I'm using Azure Table Storage, so I understand that full-text searching isn't available; I can only retrieve an "object" by its Partition Key + Row Key combination.
So imagine I have a table called "People" and a table called "Hamburgers" and each object in the tables can be related to multiple objects in the other table. Hamburgers are eaten by many people, people each eat many hamburgers.
Since the relationship is probably weighted to the people side - i.e. there are more people per hamburger than vice-versa, I would handle this in the tables like this:
Hamburger Table
Partition Key: Only 1 partition
Row Key: Unique ID
People Table
Partition Key: Only 1 partition
Row Key: Unique ID
"Columns": an extra value for every hamburger the person eats
Hamburger-People Table
Partition Key: Hamburger Row Key
Row Key: People Row Key
This way, if I'm looking at a hamburger and want to see all the people that eat it, I can go to the Hamburger-People table and use my Hamburger's Row Key to get the partition of all the people that eat the hamburger.
If I'm at a person and want to see all the hamburgers he/she eats, I have the extra values with the Row Keys of the hamburgers the person eats.
When inserting data into the tables, if the data involves a hamburger/person relationship, I would insert both values in the proper tables, then create the Hamburger-People table. If I was trying to keep a duplicate-free list of hamburgers, I would need to search the Hamburger table first to make sure the hamburger wasn't already in there (like "Whopper" - if it's in there, I wouldn't insert it again). Then, I would need to go insert a row in the hamburger's existing partition in Hamburger-People table.
But for the most part, the no-duplicate requirement doesn't exist.
Is this a good best-practices approach to NOSQL schema, or am I going to run into problems later?
UPDATE
Also, I would like to be able to partition the data tables later, but I'm not sure how to do so with this structure; adding a 2nd partition to the hamburger table would require me to store an extra value in the hamburger-People table, and I'm not sure if that would start to be too complex.
Ok, nice questions and I think most of them are the ones each RDMBS developer face as soon as hits NoSQL world:
1. How to group the partitions?
To get the best of the partitions you need to think that the load of your database should be distributed across your servers, lets see what will happend with your approach
A person with Key "A" enters to the restaurant you will save it and his burger, which is a Classic Tasty (Key "T") the person record goes to the server X and the Burger goes to server Y, now a new customer goes enters with the Key "B", and wants something different, a burger "W", again the person goes to server X and the burguer to server X, this time the server X is getting all the load, if you repeat this you'll see that the server X becomes a bottle neck, because 75% of the records are going there (all the people and 50% of the burgers), that will create some problems with your load. But... the problem will be better when you try to query because all the queries will hit the server X.
To solve this you could use the key of the person as part of the partition for the relationship, so the person will be partitioned in the same server of the burguers relationship, this way your workload will be balanced and you wont have any problems if one of the servers goes down (the person and hamburguers will be "lost" together), this will be a consistence "inconsistency"
2. Should I use a "relationship" in a NoSQL database?
Remember that NoSQL means that you are granted to duplicate information anytime your problem requires a solution to avoid "overqueries", so, if you can store the information that will be commonly queried together you will avoid a roundtrip to the database. So, if you store a "transaction" instead of "person and burguers" you will get a better performance and avoid some hits to the database, lets do an example of real data with your approach and compare it with "my" approach:
Joe Black comes to the restaurant and ask for a tasty, here you will do the following transactions:
Create a Joe Black record
Create a Burguer transaction record
if you want to list your daily transactions you will need to:
Get all the records from the day in the "table" person-burguer, then go to the person "table" and retrieve the name of the customers and now, go to the hamburguer records and retrieve their names. (you wont be able to do cross-table queries because some records could be in one server and others in the second server)
Ok, what if you create a table "transactions" and store in there the following json:
{ custid: "AAABCCC",
name: "Joe", lastName: "Black",
date: "2012/07/07",
order: {
code: "Burger0001",
name: "Tasty",
price: 3.5
}
}
I know you will have several records with the same "tasty" description, that's desnormalization which is very useful when you approach NoSQL solutions to these type of problems, now, how many transactions did you create to store the information to the database? just one! wow... and how many queries will you need to retrieve the information at the end of the day? again... just one, it will create some problems, but will save you a lot of work too, like... could you reprint the order easily? (yes you can!) what if the name of the customer changes? is that even possible?
I hope this help you some way,
I'm the creator of http://djondb.com so I think that having inside knowledge gives me a different approach to the problems according to what the database will be able to do, but I'm not aware of how azure will handle the queries if you are not able to query the document values and just the row keys, but anyway I hope this gives you an insight.

How do you store and display if a user has voted or not on something?

I'm working on a voting site and I'm wondering how I should handle votes.
For example on SO when you vote for a question (or answer) your vote is stored, and each time I go back on the page I can see that I already voted for this question because the up/down button are colored.
How do you do that? I mean I've several ideas but I'm wondering if it won't be an heavy load for the database.
Here is my ideas:
Write an helper which will check for every question if a voted has been casted
That's means that the number of queries will depends on the number of items displayed on the page (usually ~20)
Loop on my items get the ids and for each page write a query which will returns if a vote has been casted or NULL
Looks ok because only one query doesn't matter how much items on the page but may be break some MVC/Domain Model design, dunno.
When User log in (or a guest for whom an anonymous user is created) retrieve all votes, store them in session, if a new vote is casted, just add it to the session.
Looks nice because no queries is needed at all except the first one, however, this one and, depending on the number of votes casted (maybe a bunch for each user) can increase the size of the session for each users and potentially make the authentification slow.
How do you do? Any other ideas?
For eg : Lets assume you have a table to store votes and the user who cast it.
Lets assume you keep votes in user_votes when a vote is cast with a table structure something like the below one.
id of type int autoincrement
user_id type int, Foreign key representing users table
question_id type of int, Foreign key representing questions table
Now as the user will be logged in , when you are doing a fetch for the questions do a left join with the user_id in the user_votes table.
Something like
SELECT q.id, q.question, uv.id
FROM questions AS q
LEFT JOIN user_votes AS uv ON
uv.question_id = q.id AND
uv.user_id = <logged_in_user_id>
WHERE <Your criteria>
From the view you can check whether the id is present. If so mark voted, else not.
You may need to change your fields of the questions table and all. I am assuming you store questions in questions table and users in user table so and so. All having the primary key id .
Thanks
You could use a combination of your suggested strategies.
Retrieve all the votes made by the logged in user for recent/active questions only and store them in the session.
You then have the ones that are more likely to be needed while still reducing the amount you need to store in the session.
In the less likely event that you need other results, query for just those as and when you need to.
This strategy will reduce the amount you need to store in the session and also reduce the number of calls you make to your database.
Just based on the information than you've given so far, I would take the second approach: get the IDs of all the items on the page, and then do a single query to get all the user's votes for that list of item IDs. Then pass the collection of the user's item votes to your view, so it can render items differently when the user has voted for that item.
The other two approaches seem like they would tend to be less efficient, if I understood you correctly. Using a view helper to initiate an individual query for each item to check if the user has voted on it could lead to a lot of unnecessary queries. And preloading all the user's voting history at login seems to add unnecessary overhead, getting data that isn't always needed and adding the burden of keeping it up to date for the duration of the session.

Sybase select variable logic

Ok, I have a question relating to an issue I've previously had. I know how to fix it, but we are having problems trying to reproduce the error.
We have a series of procedures that create records based on other records. The records are linked to the primary record by way of a link_id. In a procedure that grabs this link_id, the query is
select #p_link_id = id --of the parent
from table
where thingy_id = (blah)
Now, there are multiple rows in the table for the activity. Some can be cancelled. The code I have doesn't disinclude cancelled rows in the select statement, so if there are previously cancelled rows, those ids will appear in the select. There is always going to be one 'open' record that is selected if I disinclude cancelled rows. (append where status != 'C')
This solves this issue. However, I need to be able to reproduce the issue in our development environment.
I've gone through a process where I've entered a whole heap of data, opening, cancelling, etc to try and get this select statement to return an invalid id. However, whenever I run the select, the ids are in order (sequence generated), but in the case where this error occured, the select statement returned what seems to be the first value into the variable.
For example.
ID Status
1 Cancelled
2 Cancelled
3 Cancelled
4 Open
Given the above, if I do a select for the ID I want, I want to get '4'. In the error, the result is 1. However, even if I enter in 10 cancelled records, I still get the last one in the select.
In oracle, I know that if you select into a variable and more than one record is returned, you get an error (I think). Sybase apparently can assign multiple values into a variable without erroring.
I'm thinking that there's either something to do with how the data is selected from the table, where the id's without a sort order don't return in ascending order, or there's a dboption where a select into a variable will save the first or last value queried.
Edit: it looks like we can reproduce this error by rolling back stored procedure changes. However, the procs don't go anywhere near this link_id column. Is it possible that changes to the database architecture could break an index or something?
If more than one row is returned, the value that is stored will be the last value in the list, according to this.
If you haven't specified an order for retrieval via ORDER BY, then the order returned will be at the convenience of the database engine. It may very well vary by the database instance. It may be in the order created, or even appear "random" because of where the data is placed within the database block structure.
The moral of the story:
Always make singleton SELECTs return a single row
When #1 can't be done, use an ORDER BY to make sure the one you care about comes last